BiasEye: A Bias-Aware Real-time Interactive Material Screening System for Impartial Candidate Assessment

In the process of evaluating competencies for job or student recruitment through material screening, decision-makers can be influenced by inherent cognitive biases, such as the screening order or anchoring information, leading to inconsistent outcomes. To tackle this challenge, we conducted interviews with seven experts to understand their challenges and needs for support in the screening process. Building on their insights, we introduce BiasEye, a bias-aware real-time interactive material screening visualization system. BiasEye enhances awareness of cognitive biases by improving information accessibility and transparency. It also aids users in identifying and mitigating biases through a machine learning (ML) approach that models individual screening preferences. Findings from a mixed-design user study with 20 participants demonstrate that, compared to a baseline system lacking our bias-aware features, BiasEye increases participants' bias awareness and boosts their confidence in making final decisions. At last, we discuss the potential of ML and visualization in mitigating biases during human decision-making tasks.


INTRODUCTION
The process of material screening during admissions plays a vital role in the intricate decision-making process for both college enrollment and corporate recruitment.Typically, this process involves independent reviews of various segments of the applicant's materials, resulting in a multidimensional assessment of their qualifications.Subsequently, reviewers record key points on a decision sheet for each application [55].
Application materials encompass a diverse range of documents, including personal resumes, additional certifications and letters of recommendation, among others.Given the substantial volume of applications, various automated techniques have emerged to assist in systematically and efficiently extracting and storing information.These techniques include academic exploration [25,54] and commercial solutions such as Daxtra1 and Bello AI 2 .In terms of material screening, computer programs can provide a more objective and consistent assessment method based on predefined criteria [34].They are also employed to achieve diversity in candidate selection [26].However, these automated methods cannot comprehensively evaluate an applicant's personality and potential, nor can they fully grasp the complexities of background information.Human reviewers, on the other hand, excel at flexible adaptation [32].Still, they are susceptible to cognitive biases stemming from perceptual illusions, false memories, logical fallacies and cognitive errors [32].These biases are inherent in human perceptual and intuitive decision-making processes.While efforts can be made to identify and mitigate these biases, they cannot be entirely eliminated [32].Furthermore, cognitive biases can be exacerbated by factors such as decision fatigue [44] and choice overload [11].
During material screening, reviewers suffer from cognitive biases stemming from several challenges.These challenges also underscore the difficulty of raising awareness about and mitigating these biases in the decision-making process.First, the lengthy and intermittent screening process can lead to recency bias3 [50], as memory of earlier assessments fades over time, and decisionmaking criteria can be erratic (I1).Second, certain attributes of applicants can trigger "halo" or "horns" effect 4 , hindering reviewers from providing unbiased assessments of other traits [24].For instance, during the initial assessment of academic factors, reviewers might encounter an outstanding achievement, like a perfect math grade.This initial impression can lead to an anchoring bias [57], where reviewers may expect equally exceptional performance in other areas then potentially lead to a biased evaluation of overall aptitude (I2).Third, balancing multiple admission goals including inclusivity and selectivity is challenging due to memory limitations and cognitive workload.Reviewers may fall prey to the contrast bias [49], where their judgments are influenced by the scores given to the adjacent applicants (I3).Lastly, forementioned challenges necessitate inevitable revisions in the material screening process.Reviewers often need to manually revise scores by reopening applicant pages, which can be challenging due to memory issues and confirmation bias5 [8] when revisiting applications consciously (I4).
Artificial Intelligence (AI) approaches, while incapable of fully replacing human decision-making in college admissions, serve as valuable tools in addressing and mitigating various cognitive biases.Previous studies in this domain can be classified into several key categories based on the life cycle of bias [15]: 1) Prevent.Preventative training approaches [10,23] aim to explicitly raise awareness of bias, although they can impose a significant cognitive burden on users.Procedural interventions, on the other hand, integrate bias awareness into the decision-making workflow by enhancing information transparency [66] or providing relevant information to reviewers.2) Discover and 3) Locate.Researchers have developed models to detect biases in real-time [22,38,61] and communicate these biases to users through visual elements [40,63], based on the definitions of different cognitive biases.4) Mitigate.Mitigation strategies and algorithms can be introduced based on machine learning methods [1,22,50] or visual approaches [13,52,53,64], offering promising avenues to reduce cognitive bias.However, existing modeling methods often target specific defined biases, neglecting the interaction between cognitive biases [32] (research gap RG1).For example, anchoring bias from the earliest applications and recency bias from the recent applications may affect next decisions in the same or opposite way.Regardless of the type of bias, they can lead to inconsistent decision outcomes, as illustrated in Figure 1.In college admissions, these inconsistent screening results may conflict with the principle of individual fairness [22], where individuals may apply different criteria at different stages of a decision task, resulting in instances with similar characteristics being treated disparately.Previous studies on fairness and diversity in college admissions [26,34] primarily focus on the rationality of final admission outcomes, overlooking personal inconsistent outcomes and the individual material screening process (RG2).Regarding visualization approaches, previous research [52,53,55] has demonstrated the potential of visual and interactive strategies to enhance human decision-making theoretically.Nonetheless, the integration of AI methods and visualization strategies to address cognitive biases was infrequent, and there was limited assessment of their combined effectiveness in practical applications (RG3).
To explore the factors contributing to inconsistent decisionmaking outcomes and the needs of reviewer for a feasible screening system, we conducted interviews with seven experienced reviewers from various academic disciplines in local universities.Based on six findings obtained from these interviews (subsection 3.2), we identified four primary challenges regarding details about four cognitive biases (introduced as C1-C4 in subsection 3.3).In light of our literature review and identified challenges, in subsection 3.4 we devised a four-step pipeline, PREVENTING → DISCOVERING → LOCATING → MITIGATING (RG1), applying to inconsistent decision-making that results from any cognitive biases, along with five essential design requirements for developing an effective system.Subsequently, we conceptualized and developed BiasEye, a bias-aware real-time interactive material screening visualization system.BiasEye served the purpose of prompting, tracking, and scrutinizing individual decision-making (RG2) during screening process in accordance with the four-step pipeline.The system's backend employed ChatGPT-4 to extract features from application materials and models individual screening preferences through a machine learning (ML) approach (RG3).On the frontend, BiasEye offered a side view that visualizes statistics for a group of applications, with each application being highlighted, as well as a summary page for retrospective decision inspection and adjustment.To assess the utility and effectiveness of BiasEye, we conducted a mixed-subjects user study involving 20 participants (RG3).The study provided strong support for the enhanced usefulness and effectiveness of BiasEye compared to a baseline system and any combination of baseline systems with the addition of the side view or a summary page.Notably, BiasEye helped participants implicitly reduce their inconsistent screening results without introducing or suggesting cognitive bias explicitly.Although the additional design elements increased cognitive load, participants reported increased confidence in their screening results' perceived reasonability and consistency.Furthermore, the system aided participants in better establishing their evaluative criteria, resulting in more concentrated scoring for high-quality applicants within the same group.Additionally, our observations indicated that BiasEye facilitated participants' understanding and explanation of the model predictions.The presence of convincing evidence played a crucial role in their final level of trust in the system's predictions.Building upon our findings, we put forth several design implications for future developments in material screening systems.Figure 1: An illustration of the contrast bias emerges in sequential material screening tasks.In such scenarios, condition of adjacent application materials can influence a reviewer's assessment, resulting in reviewers making inconsistent judgments about the same application X under varying conditions.
• In a formative study with seven participants, we identified challenges leading to cognitive biases in material screening and proposed a four-step pipeline to address them.• We developed BiasEye, an interactive screening system that models decision preferences and mitigates bias.
• A user study with 20 participants evaluated BiasEye's usability, effectiveness, and impact on behavior, workload, and confidence in screening outcomes.

RELATED WORK 2.1 Material Screening During the College Admission Review Process
In college admissions, the holistic review approach has been widely recognized and explored across various domains [36,56].A critical component of this process is material screening, which occurs after the application submission and precedes the committee meeting.Talkad Sukumar et al. [55] conducted an in-depth study on the holistic review process employed by American universities, with a specific focus on aspects related to human-computer interaction and technical support.As Sukumar et al. described, application reviewers are entrusted with a pivotal phase known as Material Screening, where reviewers draw upon their expertise and apply a predefined set of criteria, aligned with the university's mission and objectives, to evaluate applications.This comprehensive evaluation encompasses a wide array of factors gleaned from the materials submitted by applicants, including a student's high school background, family history, encountered challenges, as well as both academic and non-academic achievements, such as community service and special talents [55].The material screening process is inherently subjective and intricate.It requires reviewers to assess applicants within the broader context of their individual backgrounds and life experiences.Rather than following a rigid, predefined protocol, reviewers rely on flexible personal heuristics, however such subjectivity would inadvertently introduce systematic errors or biases [57] such as anchoring and confirmation bias [55].This study (subsection 3.3) will explore how four task challenges associated with four prominent biases affect the screening process and examine the tools and methodologies employed by reviewers, providing valuable insights into this crucial stage of college admissions.

Human Bias Detection and Mitigation
The college admissions screening process has low validity, limiting the ability to discern patterns and develop accurate intuitions, making experts prone to cognitive biases like anchoring bias [12,46,48,58], attraction effect [19], and confirmation bias [8].These biases have been extensively studied and categorized in comprehensive taxonomies [20,43,45,61].Additionally, research shows that the order of presenting the same information can significantly influence decision-making [1], recent personal decisions can serve as anchors, leading to errors or inconsistencies when reviewing the same case [22].Aligned with [22], we advocate for individual fairness, ensuring similar individuals are treated equitably while extends beyond addressing anchoring bias.In this study, we use "inconsistent" to describe situations where individual fairness is violated within the material screening process.Detecting and mitigating cognitive bias is crucial in decisionmaking processes, and previous work falls into four distinct categories: 1) Preventing.Some studies [10,23] have focused on prevention by utilizing training approaches to raise awareness and discourage biased heuristics.However, relying solely on prior knowledge may not effectively mitigate biases and can impose cognitive burdens on users [10,23].Procedural interventions integrate bias avoidance into workflows without explicitly highlighting biases, such as increasing information transparency [66] and providing more relevant information to assess applicants, thereby improving the retrievability of relevant instances [55].2) Discovering.Researchers have used machine learning and visual environments [43,50] to detect human biases, some have defined and measured bias indicators [60,61].This category is closely associated with the next: 3) Locating.Studies such as [63] and [40] visualized bias indicators within situational or peripheral view to pinpoint the source of bias.Echterhoff et al. [22] captured a reviewer's anchoring state using a probabilistic model to retrospectively locate biased decisions.4) Mitigating.Akl et al. [1] developed strategies to reduce order-effects and enhance decision-making based on probability models.Visual methods, such as design spaces [64] and simple visual representations [13], have been proposed to mitigate cognitive bias.Researches [52,53] have demonstrated that implementing visualizations in the review process can automatically address cognitive biases, alleviating user concerns.
In this study, drawing inspiration from prior research, we have integrated a four-step pipeline into our material screening system.First, we present supplementary information and statistics related to applications to prevent cognitive bias.Next, employing machine learning techniques, we create dynamic models of realtime individual decision preferences based on a user's historical choices.Through our visualization design, users can discover and locate any inconsistencies in their decisions, ultimately helping them mitigate these inconsistencies conveniently.

AI-Enhanced Approaches for Material Screening and Holistic Review Support
Material screening serves as the crucial initial step in assessing a candidate's qualifications.To enhance efficiency and fairness, various methods have been developed to optimize procedures such the holistic admission process [31] and information ordering [2], or automate particular tasks such as resume screening [47], assessment [34], and information extraction [30,35].Natural Language Processing (NLP) [28] also has been used to detect and correct resume errors [41] and conduct rating classification while reducing human bias [3].
Although automated screening can mitigate human bias, concerns about potential discrimination stemming from biased data or algorithms, including racial discrimination, have been raised [16][17][18]42].Initiatives like FairCVtest [? ], MANI-Rank [9], and Gilbert et al.'s human-centered AI tool [26] aim to address these issues.However, it's important to note that while these methods automate parts of application material processing, they may not fully capture human review patterns or contextual nuances, limiting their use in holistic admissions reviews.Our approach uses machine learning as a supportive tool for human decision-making, adapting to individual reviewer preferences to personalize bias mitigation while retaining the final decision in the hands of the human reviewer.
Several software platforms, such as Slate, Kira Talent, and Submittable [51], support holistic review processes.The American Association of Medical Colleges (AAMC) also offers tools and principles for holistic review [4].Additionally, the College Board and Education Counsel jointly published a guide [14] that includes a diversity metrics dashboard.Metoyer et al. [39] explored group decision-making and integrated visual storytelling support into collaborative review for transparency and rigor.While these efforts focus on addressing cognitive bias and human decision-making in holistic admissions, our study centers on the material screening process prior to committee meetings.We aim to enhance bias awareness and promote self-reflection through machine learning and visualization in digital applications, building models of reviewers' personal decision preferences.

FORMATIVE STUDY
This study aims to help reviewers deal with inconsistent review decisions caused by cognitive biases.To achieve this, we conduct a formative study to understand reviewers' current practices and needs.These insights will inform the design requirements for a system tailored to this context.

Participants and Procedure
To comprehensively understand the current state-of-the-art material screening process, challenges faced by reviewers, and their expectations for screening systems, we conducted semi-structured interviews with seven experienced reviewers.The participants, with a mean age of 32.6 years (standard deviation 13.1), included four males and three females, offering diverse perspectives.Our objectives were twofold: to explore current practices and challenges in material screening and identify strategies to mitigate cognitive biases while enhancing efficiency and satisfaction.Interviewees represented various roles, including admissions officers, material reviewers, and interviewers, spanning academic and professional backgrounds in fields like Computer Science, Industrial Design, Entrepreneurial Finance, and FinTech.We developed the interview script through informal discussions with admission officers and reviewers.As outlined in Table 1, participants discussed their screening procedures, experiences, and shared views on four cognitive biases: anchoring bias, recency bias, contrast bias, and confirmation bias, along with coping strategies and specific requirements within each scenario.
We used Braun and Clarke's six-phase thematic analysis framework [27] to analyze interview data.The analysis involved two researchers proficient in qualitative research methods.One researcher performed the initial coding of the data, while the other meticulously reviewed the codes to ensure accuracy and completeness.Through iterative discussions, two authors reached a consensus on the summarizing statements at first, resolving potential ambiguities or conflicts.Next, they collaboratively identified six screening findings together, subsequently giving rise to four key challenge themes discussed in subsection 3.2 and subsection 3.3, respectively.These insights informed the derivation of five design requirements, forming the foundation for a four-step strategy explained in subsection 3.4.

Findings about the Current Material Screening Process
This section presents six key findings from our interviews about the current material screening process, comparing them with findings in [55].Finding 1: Multiple rounds of material screening.Material screening has become more complex and time-consuming, with universities adopting a multi-round approach (E1, E2, E5), differing from the simplified approach in [55] where one reviewer was assigned per applicant.Moreover, reviewers encompass a spectrum of experience levels, ranging from senior assistant students acting as junior reviewers to professors serving as expert reviewers in each evaluation cycle.This approach achieves a dual objective of maintaining selectivity and inclusivity simultaneously.As E1 noted, "A significant number of applications exist, and junior reviewers should screen out the underperforming ones, thus allowing expert reviewers to focus on the more competitive submissions."Applicants undergo multiple reviews leading to interviews and committee meetings to finalize admissions.
Finding 2: Multiple reviewers in each round.To mitigate the impact of personal preferences, each applicant is assigned to reviewers from various departments, and their scores are averaged to determine the effective score (E2, E4, E6, E7).According to E6, "Reviewers possess their own preferences, and enabling reviewers with diverse backgrounds to assess the same applicants aligns with the objective of achieving a more diversified admissions process."Similar to [55], reviewers primarily handle applications from their respective or familiar regions but not exclusively so.
Finding 3: Diverse admission expectations.As noted in [55], reviewers are tasked with balancing diverse and inclusive admission goals with the school's mission.Moreover, fair assessment is ensured by considering the average scores from at least three reviewers per round.Assuming a normal quality distribution, admission Reviewers are provided with one to two weeks to autonomously accomplish their screening assignments, usually during breaks in their regular work and study schedules, as outlined in [55].Reviewers have the flexibility to either assess a few applications daily during their spare moments or allocate a dedicated continuous time block to evaluate all applications (E1-7).
Finding 5: Aggregating multi-dimensional assessments.Candidate assessment involves considering various dimensions such as educational background, academic and non-academic activities, and letters of recommendation [55].Universities assign weights to each dimension for an overall score, rather than a single cumulative score.Furthermore, the admission office provides a list of competitions or awards for seamless integration into the scoring system, as E4 mentioned, "This process has been automated recently as part of the system iteration." Finding 6: Outdated material screening system.As discussed in [55], existing material screening systems are predominantly representational and lack interactivity.Reviewers navigate a list of applications, each with bundled PDF materials and a digital decision sheet for scoring and commenting.The application list shows screening progress, scores, and a submission button.While these systems provide basic functionalities like electronic storage and accessibility, they lack advanced features.

Challenges in Material Screening Process
In this section, we will explore each challenge themes (C1-C4) by examining the fundamental characteristics (Finding 1∼6) of the material screening process, helping us identify potential cognitive biases in this phase.
C1: Balancing workload and fairness in college admissions screening.Despite the need for a holistic approach in college admissions (Finding 3), the high volume of applications often restricts the time and energy reviewers can dedicate to each student, hindering thorough exploration and deliberation (E1, E3, E4, E6).The automated awards-to-score approach in Finding 5 may reduce some workload, but not all awards are listed, and subjective judgment based on experiential knowledge remains necessary.As E7 stated, aligning average students with score distribution requirements in Finding 3 is challenging, "How do you come up with the boundary for those average students?It's a bit tricky, and honestly, I didn't know it right from the beginning."The contrast bias [49] is particularly evident with intermediate qualifications, an outstanding applicant can overshadow others, and a series of subpar materials may lead to higher scores for an average applicant [22].
C2: The screening procedure can be quite time-consuming and frequently intermittent.As highlighted in Finding 4, the screening task is susceptible to interruptions and places a substantial memory burden on reviewers due to their constrained time (E4, E6) and the fragmented nature of their personal schedules (E1, E2, E3).Moreover, the influence of recency bias [50] prompts reviewers to base their decisions on applicants they've recently assessed (E5), resulting in a fluctuating personal evaluation criterion.
C3: Reviewers might be susceptible to the allure of the halo effect6 .As Finding 5 and [55] suggest, an applicant's academic performance can anchor assessments of other dimensions (E1, E3, E4), validating the anchoring bias [57].Furthermore this anchoring effect can positively or negtively manifest in various aspects.For instance, E3 remarked, "This student possesses extensive experience and stands out among applicants.Excellent!I'm inclined to award extra points in every dimension."Conversely, E4 expressed doubts, stating, "Did his parents ghostwrite this self-introduction?Some phrases appear to be readily available online, suggesting a lack of sincerity, which raises concerns about the other achievements."Anchoring bias can subtly influence reviewers using a heuristic approach to decision-making, resulting in unintentional inconsistent outcomes.
C4: Reviewers struggle with inconvenient systems and lack guidance when making score adjustments.Not discussed in [55], initial lower scores [5] resulted from reviewers' caution due to incomplete understanding.As E7 expressed, "I acknowledge that this may seem unfair to students in the front.I'll proactively make adjustments, although I can't guarantee them."As screening progressed, decision fatigue led to declining decision quality and preference for expedient or mean-score heuristics (E4).Inconsistent outcomes (C1-C3) necessitated revisions, with reviewers revisiting decisions repeatedly and verifying before final submissions.E1 emphasized iterative score revisions to ensure fairness, stating, "It's essential to make adjustments, especially when more competitive performances are niticed further back.I need to lower those high scores at the front."However, this was exhausting due to unreliable memory and the outdated system (Finding 6).E6 suggested making retrospective assessment more intuitive and optimizing interaction beyond the current "click to display" method.

A Four-Step Pipeline and Design Goals
Drawing from relevant research and interview insights, we present a four-step pipeline to address four challenges and providing a foundation for system design D1∼D5.
Step 1: Preventing.Humans may not consistently excel at repetitive tasks [21], so enhancing screening quality, especially addressing C1, involves reducing the reviewer's workload.This means focusing on automating routine tasks, allowing reviewers to devote their attention to more complex subjective assessments (D1).To tackle C1, D1 includes preprocessing and gathering necessary information for screening applications, thus enhancing the retrieval of instances related to the availability heuristic [55].Furthermore, automating repetitive judgments and actions through intuitive representation and simplified interaction is a practical strategy for C4.
Step 2: Discovering.Screening procedure and human cognitive process constraints can lead to biased decisions (C1-C3) [32].However, participants were either unaware of or underestimated bias impact in their assessments (E4, E7).With subjective criteria involving intangible, shifting factors, an objective approach is needed to help recognize and rectify irrational behavior.Our system aims to facilitate reviewers' understanding and management of the screening process (D2), as well as explicitly reveal screening preferences to uncover potential inconsistencies (D3).
Step 3: Locating.While the initial discovery step offers an overview of the screening process, in-depth analysis is crucial to implement targeted strategies addressing inconsistencies from C1-C3.D4 involves examining bias tendencies and evaluating specific bias instances.Our system should provide transparent, comprehensive information for multifaceted material comparisons.Through interactive visualization, reviewers can identify inconsistent outcomes and make informed judgments, promoting fairness and objectivity in screening.
Step 4: Mitigating.The final key step in bias mitigation is score modification.The current system, mainly representational, lacks interactivity, burdening reviewers physically and cognitively when adjusting decisions (C4).Additionally, comparing numerous similar candidates for reasoned scores is challenging (E7).To address C4, our system should enable quick adjustments and provide reasonable score recommendations (D5) to ease reviewers' bias concerns.Meanwhile, interactive visualization is a promising way to enhance assessment efficiency and effectiveness.

BIASEYE
In line with design goals D1∼D5, we present BiasEye, a real-time interactive system aimed at assisting reviewers in preventing, discovering, locating, and mitigating inconsistent decision-making.Implemented using Flask and Vue.js frameworks, it leverages Elementplus components 7 and D3.js8 [6] for visualization.BiasEye consists of three pages: 1) Student List, displaying assigned applications and screening progress; 2) Assessing, showing extracted information and original PDF materials for application assessment; 3) Summary, offering retrospective bias-aware score inspection and revision through the Screening Sheet, Comparison view, and Ex-situ Table .All three pages share a Statistical view accessible via header navigation (Figure 2-a).Potential usage methods and scenarios are explored in subsubsection 6.2.1 to address inconsistent outcomes.

Screening Sheet
Aligning with the admission committee's criteria, the Screening Sheet includes a Basic Information section and several screening sections, each with a unique color.In addition to the score 9and the comment component involved in the original decision sheet mentioned in Finding 6, each section also showcases structured entries extracted from resumes and included a box plot showing statistical data for the assigned scores.
As depicted in Figure 3, we first convert PDF files into TXT format and filter out resumes with incomplete or inaccurate information, ensuring the quality of our analysis.To extract information, we explored models and tools like CNN-BiLSTM-CRF [35] and pyresparser10 , but these had suboptimal performance due to diverse resume formats and limited training samples.Consequently, we fine-tuned ChatGPT-4 11 , implementing error correction codes and human verification for precision and consistency.Despite limitations like incomplete extraction of low-probability information with limited training data, this tool was effective in resume information extraction.Finally, as depicted in Table 2, all raw text was extracted and structured into JSON format, encompassing five sections: Basic Information, Educational Background, Competition, Honors, and Extra Activity.Letters of Recommendation (LoR) and Personal Statements (PS) are not displayed in this sheet, but users can access these original files directly from the Assessing page.
Considering D1 in the PREVENTING step and D4 in the LOCAT-ING step, we incorporate the Screening Sheet with user-friendly interactivity into the Summary page.This allows reviewers quick access to concise information about the selected applicant.While some details may be missing compared to the original PDF, the sheet provides ample cues.If more information is needed, reviewers can switch to the Assessing page to examine the PDF.

Statistical View
In response to D1, we design the Statistical view (Figure 2-A) on BiasEye's left side.This view presents global statistics for the current application group, visualizing 12 key indicators.These include school and normalized student GPA rankings, competition and honor count at various levels, and publication counts with corresponding conference/journal levels.Each indicator (Figure 4) uses box plots to convey central tendencies and data dispersion, density plots to offer detailed distributional insights, and scatter dots to depict the cases of the currently selected students, offering an overarching perspective that aids PREVENTING recency and contrast bias.For each section, we defined a set of significant attributes denoted as  =  1 ,  2 , ...,   based on feedback obtained during the formative study.These attributes serve as straightforward proxies for human decision-making preferences, as outlined in Table 3.On one hand, most of these attributes are derived from features extracted directly from entries within the JSON file.Numerical and quantitative features, such as the count of different competition levels, can be readily obtained from the respective entries.Some features necessitate a text classification step before quantitative calculations, such as determining whether an applicant served as a manager or participant in a project.To facilitate this, we presented input and output samples to ChatGPT and guided it through the classification process, providing explanations along the way.This approach aimed to encourage ChatGPT to engage in a more deliberate thought process.On the other hand, two attributes, namely school ranking and publication ranking, were derived from additional information.This addresses D1 and aims to streamline the information search process, ultimately reducing the reviewer's workload.The school ranking is assigned a label from 1 to 200 12 , while the conference/journal level is categorized from A to D13 , where 'D' signifies 'unknown'.As an extra enhanced version of the Student List table, Ex-situ Table incorporates additional visualizations and interactive features to address D2, D4 and D5.It provides an overview and facilitates score modifications, displaying application ID, applicant name, and section duration, which are calculated from time difference between two consecutive scoring events.Hovering over a stacked bar (Figure 5) in 'Time' column reveals specific time values.Clicking a row in the table updates the Screening Sheet and highlights the corresponding application glyph in the Comparison view.The table dynamically displays the corresponding section column based on the selection in , enabling direct score modification for D5.Additionally, the Ex-situ Table offers an interface that employs a machine learning method, specifically, Ranking SVM, to help users DISCOVER (D2) inconsistent decision outcomes for each screening section.Through the use of a slider and checkboxes ✓, users can select a specific number of assessed applications as trusted training samples.Clicking the button activates the Ranking SVM for analysis.

Ex-situ Table
Inspired by Podium [62], we employed Ranking SVM [33] to automatically infer attribute weights from user-assigned application scores.This approach serves two purposes: Firstly, it helps reviewers examine their individual screening priorities and preferences, providing insight into how personal biases and emphases may affect their assessments.Secondly, Ranking SVM forecasts future review tendencies using past records, indicating potential biases.Its low computational cost enables real-time monitoring, allowing reviewers to make timely adjustments and evaluate the appropriateness of modifications.
Derive constraints.Ranking SVM optimizes a loss function involving pairwise constraints based on the Support Vector Machines (SVM) framework.We constructed a training set for the Ranking SVM model using a subset of  (> 6) user-selected assessed applications, each assigned a score represented as .We form pairs of data points (  ,   ) with a label .If  (  ) <  (  ), we set  = 1; otherwise,  = −1.For all pairs ,  ∈ 1, . . .,  where  ≠ , we generated constraint tuples based on this criterion and treat all constraints as soft constraints.
Calculate the ranking and transfer to score.After training, we obtained a weight vector for the attributes to rank all the data items.We computed individual dot products of the weight vector  To address D1, we developed a visual glyph (Figure 6) for comparing human scores and model predictions.Each glyph corresponds to an applicant, with the number denoting the application ID, the outer ring encoding the human score and the inner ring encoding the prediction.A linear color scheme is used for both rings, facilitating the rapid identification of applications with varying scores.The ID color indicates whether the human score is higher/lower than or close to the prediction, highlighting inconsistencies in human scores and their direction.To LOCATE (D4) anomalies among similar applications, glyph position is determined using the t-SNE [59] method based on the attributes of selected section, ensuring similar applications are closer together.Solid dots represent high-dimensional centers of all applicants who received the same human score, follow the same color scheme as the glyph rings and are connected from lowest to highest scores (D2).Furthermore, to provide visual aid for D4, hovering over a glyph or center highlights applicants with the same human score.

USER STUDY
Obtaining institutional IRB approval, we conducted a user study involving 20 participants with mixed backgrounds.The primary aim of this study was to assess the effectiveness of our bias-aware design.To achieve this, we aimed to address three key research questions (R1 − 3) through our evaluation process.
• R1: How are the usability and effectiveness of the biasaware system in material screening?• R2: How will participants interact with and be affected by the bias-aware system in material screening?• R3: How will participants trust and collaborate with the ML method?
5.1 Experiment Setup 5.1.1Dataset.We obtained IRB approval for data collection and used a dataset from a local university's information science master's program.Graduate admission, similar to college admission, emphasizes merit and alignment with the institution's mission.We randomly selected two groups of 40 complete applications (excluding incomplete ones) for a preliminary trial and a formal experiment.Each application included a resume, academic transcripts, a personal statement (PS), and up to two letters of recommendation (LoR), all in PDF format.To ensure the experiment's completion within 1.5 hours, we retained only the resume, transcripts, and certificates.Notably, these materials were from past admission interviews, and we had only raw PDF files, making it impossible to verify results with reliable ground truth.Additionally, we anonymized identifiable details like names and photos.

Baseline System and Control
Conditions.We adopted a twopronged approach to assess the effectiveness of our system.First, we used a between-subject design to evaluate the Statistical view, dividing participants into two groups randomly: Group A used the baseline system, and Group B used BiasEye.Both systems consisted of three pages (Figure 2-1 2 3 ), but the baseline system lacked the Statistical view and publication level in the Screening Sheet (Figure 13).Both systems were hosted on a web server, accessible to participants via public links.Second, we used a within-subject design to evaluate the Summary page in two stages.In stage I, participants could only use the Student List and Assessing pages.In stage II, participants could further adjust their decisions using the entire system.

Participants
We recruited 20 participants (P1 to P20): 14 males and 6 females.Among them, 6 held bachelor's degrees, 12 held master's degrees, and 2 held Ph.Ds.Participants were evenly divided into the experiment (B) and control (A) groups based on demographics (Table 4).Before the formal experiment, all participants signed a confidentiality agreement, became familiar with the training program and department's mission.Special attention was given to those without prior relevant experience (n > 20) to ensure they understood the screening expectations.Their participation was incentivized by performance-based compensation.

Task and Procedure
5.3.1 Task.We simulated a real-world material screening scenario for user study.Participants were instructed to act as students in a Human-Computer Interaction (HCI) laboratory, tasked with preliminarily review 40 admission applications due to the time constraints of their professor.The participants' responsibility was to consider multiple factors like personal backgrounds, experiences, abilities, and the lab's requirements.Their anonymous screening outcomes would be combined with others to determine final screening results.
To fulfill this task, participants had to: 1) assign scores to each application in four sections: Education Background (EB), Competition (Com), Honor (Ho) and Extra Activity (ExA), which were chosen based on the actual department criteria.2) They were prohibited from discussion and communication, and 3) were not required to consider score weighting within each section.4) They were encouraged but not forced to aim for an average score of 3 in each section.Additionally, online references 14 including school rankings, conference and journal rankings, and a formal document listing the level of college student competitions were provided for assistance.Before the study, participants signed confidentiality and completed a pre-task questionnaire collecting demographics.We introduced the experimental task and its objectives in a comprehensive manner, rather than explicitly disclosing the focus on cognitive bias, we emphasized the core principle of individual fairness and underscoring the gravity of inconsistent outcomes.This approach ensured that participants remained unaware of the precise nature of our study.Next, we introduced the system corresponding to their belonging condition in stage I and provided a set of toy trial materials for familiarization.During Phase I, participants were allotted 50-70 minutes to complete the task as consistently as possible, then submitted their results and filled out an in-task questionnaire.The main goal of Phase I is to assess how the introduction of the Statistical view impacts the consistency of participants in decision-making.To address potential residual effects between the two experiments and reduce response bias within the two conditions, we adopted a between-subject design approach.
Moving to Phase II, we introduced the Summary page and the Ranking SVM model, which learns participants' screening preferences and predicts scores.To directly compare the change in decision-making before and after model intervention, we utilized a within-subject design independently for both groups.Simultaneously, both groups maintained a between-subject design that included the Statistical view as a variable.Participants were given 20 minutes to revise their outcomes with the assistance of Summary page.Subsequently, they submitted again and completed a post-task questionnaire.
Two of the authors acted as experimenters to ensure smooth progress and provided assistance as needed.The study spanned approximately two hours, with participants receiving USD 12 compensation on average.

Data Collection
We conducted a general quality check for each participants by examining the usage time of Phase I, which started when they began the task and ended at their first outcome submission.One submission from group B (P20) was rejected due to a extraordinarily short duration (30 minutes) for Phase I. Besides, one scoring log files form group A (P1) were irreversibly corrupted, we excluded his log files and questionnaires from quantitative analysis but kept video for qualitative analysis.We ended up with 18 valid responses, 9 per group.All data will be used solely for experimental outcome analysis and won't be shared or disclosed non-anonymously.

Measurement
For both the in-task and post-task questionnaire, we utilized a 7point Likert scale (1: Not at all/Strongly disagree, 7: Very much/Strongly agree, and a 10-point scale for workload-related questions) to collect participants' feedback on the respective systems and their attitudes toward their own results in different phases of the study.First, in line with the System Usability Scale (SUS) [7], we crafted questions primarily including: 1) Ease of use; 2) Ease of learn; 3) System satisfaction; and 4) Likelihood of future use; Second, in terms of Self-Evaluation, we designed questions mainly including: 1) Consistence criteria; 2) Degree of distinction; 3) Fewer revisions; and 4) Perceived efficiency promotion.Third, drawing from the NASA-TLX survey [29], we posed questions about Workload Assessment, including: 1) Psychological workload; 2) Physical workload; 3) Time workload; and 4) Level of frustration.Fourth, as for System Design, we tailored questions concerning the Statistical view for group B participants in the in-task questionnaire and regarding the Ex-situ Table and Comparison view for both groups in the post-task questionnaire, including: 1) Intuitive visualization; 2) Convenience of interaction; and 3) Overall helpfulness.Additionally, we also included optional subjective questions for qualitative insights.Participants were instructed to "think aloud" throughout while their screens and audio were recorded.The system documented the section name and score for every modification in scoring logs during both phases for later quantitative analysis in section 6.

RESULTS AND ANALYSIS
This section organizes quantitative and qualitative results for research questions R1 to R3.Our quantitative analysis, besides descriptive statistics, employed the Mann-Whitney U test [37] to investigate differences between groups using different systems, and the Wilcoxon signed-rank tests [65] to evaluate disparities between groups of participants using the same system.For our qualitative analysis, one author transcribed participants' screen recordings, capturing system usage and reactions to potential inconsistent decisions.Two authors then coded these transcriptions using thematic analysis [27], with specific examples included in this paper.
6.1 RQ1.How are the usability and effectiveness of the bias-aware system in material screening?
As shown in Figure 8, the questionnaire presents participant ratings of system usability various stages and with different systems.
When comparing the Phase 1 data for both systems, we observed that the BiasEye system did not lead to a significant increase in 'ease of use' or 'ease of learning'.However, it did demonstrate a substantial increase in 'satisfaction' ( = 3.5,  < 0.01) and 'future use' ( = 1.5,  < 0.001).
Conducting a comparative analysis of data within the same system at different phases, we noticed that the introduction of the Summary Page had a significant impact.Specifically, it led to a decrease in 'ease of learning' for both the Baseline and BiasEye systems ( = 0.0,  < 0.05 in Baseline,  = 0.0,  < 0.05 in BiasEye).Furthermore, it significantly enhanced 'satisfaction' ( = 0.0,  < 0.05) in the case of the Baseline system.However, there were no significant changes in terms of 'ease of use' and 'future use' for both systems.
Moving forward, we proceed to evaluate the efficacy of the Bias-Eye system by delving into the data collected from participants as they engaged in the real scoring process.Our analysis has unveiled the following two key findings.
Finding 7: The Statistical view and additional information facilitates participants in raising awareness of bias in the process and proactively reducing inconsistencies in decisionmaking.Our findings stem from an examination of participants' interactions with the system, focusing on instances where they adjusted their initially assigned scores.The statistical analysis of score revisions during Phase I and Phase II is presented in Figure 9(a) and Figure 9(c), respectively.
In Figure 9(a), it becomes apparent that participants using the BiasEye system displayed significantly higher average frequencies of score revisions for the EB ( = 522,  < 0.01), Ho ( = 539,  < 0.01), and Sum ( = 389,  < 0.001) categories compared to those using the Baseline system.As there was no machine learning intervention in Phase I, participants adjusted their decision outcomes relying on personal judgment.The transcripts indicate that participants recognized the inconsistency in their initial decisions, and their perception of this inconsistency became more pronounced and less ambiguous.Participants demonstrated the ability to discern candidates with varying qualifications more swiftly and accurately.
To reinforce this observation, we plotted a scatter plot in Figure 9(c) depicting the average number of score changes against the applicant sequence, fitting a linear function.As depicted, as the number of students being scored increased, both groups experienced a decline in the frequency of revisions.This observation aligns with the expectation that participants' evaluation criteria improve and stabilize over time.Notably, the fitted line for BiasEye users is always higher than the baseline, suggesting that the proposed system increased the number of revisions generally, rather than being influenced by outliers.The results from Phase II further substantiate the notion that the BiasEye system contributes to the decrease of inconsistent decisions.As illustrated in Figure 9 (b), participants utilizing the BiasEye system exhibited significantly lower frequencies of revisions in the Com ( = 1, 086,  < 0.01), ExA ( = 1, 028,  < 0.01), and Sum ( = 1, 181,  < 0.001) categories.Despite being exposed to more comprehensive global information in Phase II, participants employing the BiasEye system had already mitigate most of inconsistent decision outcomes during Phase I, decrease the requirement for additional score revisions.P10 explicitly stated, "without those charts on the left (Statistical View), it can be kind of hard for me to tell the difference between the different application levels because the scores start to blur together.Having those charts really makes a difference for me." Finding 8: Participants utilizing the BiasEye system exhibit more concentrated scoring for high-quality applicants, resulting in fewer instances of inconsistent outcomes.While cognitive bias can play a role, it's important to recognize that different reviewers may hold varying opinions about an application.Existing literature, as mentioned in Coleman et al. [14], emphasizes the use of "interrater reliability" to ensure the effectiveness and consistency of screening decisions.One way to assess this is through "composite reliability", as outlined by Coleman and colleagues [14], where a group of reviewers score within an acceptable range.
To evaluate whether the Statistical view in the BiasEye system helps mitigate screening inconsistencies, we compared the screening outcomes for high-quality applications in Phase I at different score levels (assuming equal section weights) in both Group A and Group B. We took the intersection of the results from both groups to ensure consistency.The outcomes are presented in Figure 10, where each bar represents the number of applications receiving a specific score.Here, we denote the number of compared applications as  and measure the kurtosis of the histograms as .
Our observations indicate that Group B, using the BiasEye system with the Statistical view, exhibits more centralized outcomes, reflected in the higher kurtosis value ().A similar trend is observed when comparing Phase I to Phase II, regardless of whether the Baseline or BiasEye system was used.This evaluation underscores the effectiveness of the Summary page in mitigating inconsistencies, as seen in Figure 11.It's worth noting that the phenomenon in Figure 11 is less pronounced due to the comparison being based on an intersection, which excludes a significant portion of adjusted applications in Phase II.
6.2 RQ2.How will participants interact with and be affected by the bias-aware system in material screening?
Building upon the earlier-discussed analysis methods outlined in section 6, we initially explore the ways in which participants will engage with our bias-aware designs to fine-tune their decision outcomes.Subsequently, we proceed to unveil the discoveries regarding how these designs impact participants' cognitive workload and their self-evaluation of the decisions made.
6.2.1 Usage pattern.This section presents the observations regarding how participants utilize our system design in a systematic  four-step process aimed at preventing, discovering, locating, and mitigating inconsistent decision outcomes.
Step 1. Preventing.Video transcription shows all Group B participants employed the Statistical View for insights into application materials and positioning applicants, and considered publication level as a screening criterion.Conversely, most Group A participants (7 out of 10) tried to check the given reference, but nearly all (6 out of 7) stopped after about 20 applications.Interestingly, three participants ignored this feature.These observations support our design motivation to enhance information transparency and accessibility.
Step 2. Discovering.Participants predominantly employ two categories of methods to identify inconsistencies in their decision outcomes on the Summary page.First, a minority of participants (3 out of 19) inspected exceptions to the time allocation in the Ex-situ Table .Second, the majority of participants (17 out of 19) used the back-end model to aid them in discovering potential anomalies through prediction scores.They selected trusted samples for backend training through three distinct approaches: • Most participants (13 out of 19, including 7 from Group A) directly chose applicants falling within a specific range based on the screening order.This range deliberately excluded the initial 5-10 applicants, as participants perceived their screening criteria to be either more lenient or stricter for this subset.This observation suggests that cognitive bias cannot be entirely eliminated, even with the aid of statistical information and supplementary.
• A subset of participants (5 out of 19, including 2 from Group A) manually selected representative applicants from each score category (1-5) using checkboxes.• Participant P9 employed an unconventional approach that exceeded our expectations in sample selection.Initially, P9 included all applicants in the first round of training and then chose applicants exhibiting consistency between the model's predicted scores and human scores as the final samples for the second round of training.In the video, P9 mentioned being unfamiliar with machine learning but believed this approach could help identify samples that could serve as representatives of his scoring criteria.We observed an increase in the number of consistent outcomes after the second round of training, although this may have occurred by chance.Exploring whether repeating such operations could lead to convergence and automate the process is an intriguing topic for future research.
Step 3. Locating.Building upon the methods outlined in Step 2, participants employed specific strategies to identify anomalies.This process can be categorized into two distinct approaches.First, a minority of participants (3 out of 19) directly scrutinized applicants who received either inadequate or excessive time allocations, classifying them as cases of oversight or difficulty in decision-making, respectively.Second, participants who utilized the back-end model (comprising 17 out of 19) employed two primary methods to pinpoint applicants with potential inconsistencies: • Five participants harnessed the sorting function within the 'Ex-situ Table '.They initially sorted the table based on the columns labeled 'EB/Com/Ho/ExA' or 'Mitigate'.Their focus was directed towards applicants where the order of predictions/human scores contradicted the ascending or descending order of human scores/predictions.• Twelve participants identified potential inconsistencies by observing the ID color and assessing the variance between the two rings of a glyph.When confronted with multiple anomalies marked with blue or red colors, participants developed distinct patterns of focus: i) A majority (7 out of 12) concentrated on applicants displaying a high discrepancy between the two rings, a preference influenced by their personal perception.ii) Two participants focused on identifying the lower/higher scores within an overall trend of higher/lower scores.iii) Three participants searched for inconsistencies within the pool of applicants who had received high human scores.
These patterns of focus shed light on participants' expectations of generating rational screening outcomes.
Step 4. Mitigating.Participants accessed the Screening Sheet of the corresponding student on the Summary page by clicking on rows within the Ex-situ Table .They employed various strategies to adjust the assigned scores, which included: (1) Comparing the applicant's score with those who received the same score; (2) Comparing the applicant's score with individuals who had similar predicted scores from the model; (3) Comparing the applicant's score with students positioned closely in the Comparison view.(4) Relying entirely on, or taking into consideration, the model's recommendations; (5) Referring to the keywords listed in the notification card to understand the model's rationale and checking if any relevant features were overlooked during Phase I; (6) Assessing the model's performance based on keywords and the distribution of ID colors in the Comparison View to determine whether further examination of potentially inconsistent applications was necessary.The sixth strategy is particularly relevant to the issue of trust in the model, and our findings related to this are presented in subsection 6.3.These strategies underscore the adaptability of our system design, accommodating the diverse usage habits and preferences of individual users while achieving the goal of mitigating inconsistent decision outcomes.6.2.2 Effects on participants' cognitive workload.In this section, we employ questionnaire data to assess the variations in workload among participants when comparing the Baseline and BiasEye systems.The results are visually presented in Figure 12 (a).During Phase 1, BiasEye significantly reduced psychological ( = 70.0, < 0.01) and time workload ( = 72.0, < 0.01) compared to the Baseline.Transitioning from Phase 1 to Phase 2, the introduction of Summary page resulted in significant reductions in both psychological ( = 0.0,  < 0.05 in Baseline,  = 0.0,  < 0.05 in BiasEye) and physical workloads ( = 0.0,  < 0.05 in Baseline,  = 5.5,  < 0.05 in BiasEye) for both systems.Participants did not report significant changes about time workload and feeling of frustration.

6.2.3
Effects on participants' self evaluation.Figure 12(b), we present the differences in self-evaluation between the Baseline and the Bias-Eye system.During Phase 1 of the experiment, participants using Bi-asEye reported experiencing more consistent criteria ( = 8.0,  < 0.01) and better distinction among applications ( = 11.5,  < 0.01) compared to the Baseline group.Additionally, BiasEye significantly improved the screening efficiency ( = 6.0,  < 0.01).Moving on to Phase 2, the introduction of the Summary page had a notable impact on both groups.It enhanced the criteria consistency ( = 0.0,  < 0.05 in the Baseline group and  = 3.0,  < 0.05 in the BiasEye group) and improved the distinction among applications ( = 0.0,  < 0.05 in the Baseline group and  = 0.0,  < 0.05 in the BiasEye group).However, it's important to note that only the Baseline group reported a significant increase in efficiency.
6.3 RQ3.How will participants trust and collaborate with the ML method?
Through a qualitative analysis of video transcripts, we identified varying levels of trust among participants in the suggestions provided by the model.This trust, in turn, influenced their collaborative interactions with the machine learning-supported assistant system.Among the 19 participants in our study, only two opted not to utilize the model.The remaining 17 participants all made revisions based on the model's recommendations.It's important to note that participants retained ultimate decision-making authority when it came to screening results.They determined whether to accept, refer to, or question the prediction scores of an application, integrating their own understanding of the application materials.The machine learning method served as a supplementary tool, offering a clear and expedited path to identify inconsistencies within specific applications.More specifically, our study revealed the following findings into participants' trust in and collaboration with the ML method.Finding 9: Participants' trust in the model's performance is screening section-independent.Participants' lack of trust in the model's performance in one section did not influence their trust in other sections.For instance, P6 remarked, "The model's predictions in the Ho section are not accurate, but it does help me identify many incorrect scores in the EB section."Similarly, P11 expressed, "I'm quite confident in how the model handles quantitative data, but the content in the EXA section encompasses various elements, and I doubt the model could comprehend my criteria." Finding 10: Participants generally attempted to comprehend the rationale behind model predictions, but success was not guaranteed.Participants often inferred the reasons behind the model's predictions by examining various factors, including the prediction itself, attributes with significant weights displayed on the Notification Card, and raw information about each student.These inferences ranged from grasping the overall logical reasoning of the model to providing individual explanations for specific application predictions.For example, P3 remarked, "Attributes on the notification involve scores of the English proficiency test (CET) and school ranking, but the model thinks I gave high scores for many applications... um, it is sensitive to CET scores, I care less unless the CET score is under 500."Conversely, according to P13, "The model scores 2, but he comes from an experimental class at a university, with understandably low ranking that the model might have overlooked.I'm sticking to my opinion."As the system did not explicitly specify the concrete attributes contributing to each application's prediction, there were instances where participants found it challenging to make successful inferences, leading to comments such as, "I can't understand, " as noted by several participants.
Finding 11: Participants tended to question the rationality of their decisions when there was a significant disparity between the predictions and human scores/expectations.Participants' awareness of these differences stemmed from two primary sources.On one hand, it was influenced by the overall color trend of the ID text in the Comparison view, as noted by P16 who mentioned, "There are many red colors, and I am overwhelmed."On the other hand, participants observed discrepancies between the human scores they assigned and the predicted scores for each application.P8 remarked, "The prediction is around 5 points, but why did I only give 1 point?Although I don't quite understand why it scores 5, I decide to increase the score a bit."Despite being informed at the beginning of the experiment that the model's predictions may be inaccurate, participants still exhibited a degree of blind belief and reliance on the model, particularly when they felt uncertain about an application.As P8 questioned, "I am struggling with this score, or should I listen to the model?"Nevertheless, it's worth noting that the proposed system has mitigated confirmation bias to some extent by encouraging participants to engage in a second round of deliberation.
Through an analysis of the video transcripts, we also identified various factors that influenced participants' trust.
Factor 1: The consistency between keywords and participants' perception of decision criteria.The attributes listed on the Notification Card served as the initial point for participants to grasp the model's functioning.When these attributes did not align with the participants' preconceived criteria, it led to doubts regarding the model's predictions.For instance, P11 exhibited skepticism towards the attributes in the ExA section.Upon identifying discrepancies and disagreeing with the predictions for three applications in the Comparison view, P11 promptly cross-referenced and verified the scores of multiple applications in the Ex-situ Table independently.We observed that inconsistent perceptions could also arise from differences in how attributes were categorized and participants' mental frameworks.For instance, competitions were initially categorized into different subjects within the Com section.However, participants were often unaware of the distinctions between different subject areas within competitions, particularly for competitions that were rarely mentioned in the materials and thus not well-remembered or paid attention to.
Factor 2: The significance of differences between predictions and human scores.Participants exhibited strong trust in prediction scores that closely matched or were consistent with human scores.None of the participants actively sought applications with gray-colored IDs in the Comparison view or those where human scores and predictions were sorted in the same order in the Ex-situ Table .As P9 stated, "Both the model and I agree with these scores, so there's no issue at all."The greater the difference between the human score and prediction, the more likely it was for inconsistencies in applications to exceed the participants' threshold and capture their attention.For instance, P6 commented, "Differences less than one are not an issue.I'll check the others that had larger score differences."Conversely, the overall color trend of IDs in the Comparison view also influenced participants' trust in the model's predictions.P12 mentioned, "There are many gray ones, so I believe that the model has learned well."It's essential to note that this factor does not contradict the phenomenon of self-doubt arising from higher/lower color trends mentioned in Finding 11.
Factor 3: The presence of sufficient evidence for confirmation and trust.Participants were more inclined to trust predictions when they discovered ample evidence to support them.This evidence could be gathered by verifying whether key information in an application had been overlooked or by making comparisons between multiple applications.For example, P4 admitted, "It's my fault.I didn't pay attention to the Mathematical Contest In Modeling just now."In a similar vein, P11 commented, "Compared to other applications that meet my expectations of four, this application is indeed slightly worse.I will follow the model and adjust it to three."Differences in how participants and the model interpreted the same information sometimes hindered their trust in the predictions.For example, P13 from Group A, which did not have access to the Statistical view, questioned, "Why did the model give him a score of four when he's from an average university and his GPA is not at the top level?"Subsequently, the participant referred to supplementary materials and found that the university was ranked around the top 50, which is considered quite good.This incident highlights how human judgment can be influenced by personal experiences, potentially leading to biases, such as the availability heuristic [57] and confirmation bias, which makes individuals ignore objective truths.Notably, this issue was not observed among participants in Group B (BiasEye), as the Statistical view provided valuable evidence.
Factor 4: Participants' intrinsic perceptions of machine learning.Participants' intrinsic beliefs about machine learning significantly influenced their trust in the system.For instance, P7 expressed confidence, stating, "The system is definitely more accurate than I am."Similarly, P15 held the view that, "Machines don't get tired; they have no blind spots in attention."Conversely, some participants like P4 were more skeptical, stating, "I've learned about machine learning algorithms.If some attributes do not appear in the selected samples, it cannot learn them.P18 also struck a balance, noting, "I believe that machine learning can assist me, but I'm aware it has limitations too.I won't blindly follow it."These inherent perceptions of machine learning played a pivotal role in shaping participants' trust.

DISCUSSION AND LIMITATION
In this section, we extract future design considerations DC1∼4 (subsection 7.1) from our analysis results and questionnaire feedback.We also explore potential generalizations of our findings to other domains in subsection 7.2 and reflect on the limitations of our work in subsection 7.3.

Design Consideration
DC1: Improve the interactive capability of the system.Participants appreciated BiasEye's interactive features in our study, such as real-time score box-plot updates, highlighting the current application in the Statistical view, and quick navigation between Screening Sheets, which alleviate their workload.A bias-aware intelligent interface for decision-making should seamlessly incorporate interactive functionality, enabling users to devote more cognitive resources to thoughtful judgment.This integration is essential for encouraging users to actively address biases in decision outcomes.Additionally, such systems should gather and present more contextual information to support well-informed decisions.Our study revealed that certain participants in group A, like P5 and P13, infrequently referred to supplementary materials and were influenced by personal experiences, leading to inconsistent screening results.To alleviate the impact of inadequate or incorrect memory and perception, a recommendation is to implement dynamic annotations within the interface.These annotations could include hyperlinks to pertinent information such as school, major, competition details, and data on past admitted students.If this information could be aggregated, the interface might visualize a comparison between individual and collective data.Consequently, instead of facing unfamiliar and ambiguous perceptions, users could swiftly grasp relevant information.
DC2: Simplify views and visual designs.The design of visualization and functionality should prioritize intuitiveness, avoiding the need for complex computer expertise and minimizing the learning curve.In our study, participants acknowledged the attractiveness of glyphs but found their placement lacked meaning, as highlighted by P6 and P11.The process of visualized dimensionality reduction added cognitive demands and had the potential to cause misunderstanding.Interestingly, the Ex-situ Table view was deemed more user-friendly than the Comparison view, leading participants to prefer a format combining glyphs with a table presentation.As a result, future interface designs could incorporate tables with multiple straightforward mini-charts, offering a more effective way for users to understand data without increasing cognitive load.Additionally, for complex decision tasks like material screening, it remains uncertain whether a multi-view visual analysis strategy is a more effective option.
DC3: Enhance machine learning with human guidance.Our observations unveiled that pre-specified model training attributes approximated only a limited subset of participants' personal criteria.Despite some commonalities, each participant had unique focus areas.A universal model struggled to differentiate individual applications based on specific criteria and often misclassified similar applications due to attribute redundancy.While more intricate models and comprehensive attributes could align better with actual screening criteria, there exists a trade-off between a perfect fit and real-time response.AI methods may not be as proficient or accurate as domain experts in verifying applicants' contributions and identifying potential exaggerations.Moreover, AI faces challenges in acquiring contextual knowledge, such as how personal experiences are influenced by socioeconomic and geographic disparities.Implicit discrimination may be hidden in the superficial quantification of applicants based on factors like SAT scores and academic awards.To address the limitations of ML methods, future intelligent screening systems should adopt "human-in-the-loop" approaches.Specifically, the interface can allow model training for customized attributes, correction of deviant model, special marking and score lock of outliers (e.g., students at risk of fraud or those considered deserving of preferential treatment).
DC4: Acknowledge the constraints of AI assistance techniques.The majority of participants (11 out of 18, with 5 not providing a response) acknowledged that automated information extraction improved retrieval efficiency and reduced their workload.Additionally, they found that the ML method assisted in addressing inconsistencies in screening decisions.However, it was also observed that participants tended to heavily rely on AI support methods, particularly the ML predictions.Consider the limitations of ML methods mentioned in DC3, human-machine collaboration strategies should be devised to promote AI in complement with human decision-making, rather than allowing unchecked dependence on algorithms.In this context, future intelligent interfaces should discourage the outright use of AI in initial decision-making, instead supervising users to adopt recommendations with adequately understanding.For example, system can pop up temporary windows to declare the limitation of the AI method, inquire about users' confidence in their personal judgment versus AI prediction, and encourage users to assess the consistency of their judgment with AI recommendations.

Generalizability
Tasks such as corporate hiring, fund applications, and scholarship selections often require the evaluation of numerous multidimensional and multi-modal materials.These tasks commonly face different cognitive biases, resulting in inconsistent outcomes and affecting individual fairness.BiasEye is flexible and can be tailored by modifying the necessary attributes and algorithms to meet the specific requirements of a task.In our user studies, the simple Ranking SVM demonstrated encouraging results in assisting with bias mitigation.We are also interested in exploring more advanced approaches, such as neural networks, capable of capturing complex reasoning processes to further improve the effectiveness of bias mitigation.

Limitation
This study primarily evaluates our bias-aware screening system design, excluding information extraction as an experimental condition.However, it's important to note that data extraction and classification models can introduce errors, highlighting the need for better document organization in application submissions.We recommend institutions implement formal systems for collecting structured personal information alongside documents, which can improve screening system design and functionality.Additionally, BiasEye relied on quantitative attributes for prediction, potentially missing nuanced human screening preferences, especially for indicators like project content and quality.To address this, exploring specialized language models or textual information extraction features may enhance learning and prediction, particularly in detecting biases in PSs and LoRs.Future systems could also simplify screening through content analysis for categorical comparisons of applicants.Lastly, due to constraints, we conducted a controlled in-lab study with senior students, not directly comparable to expert admissions reviewers.We plan to pursue a field study after further system optimizations.

CONCLUSION AND FUTURE WORK
This study introduces BiasEye, a specialized interactive system designed to address, detect, and mitigate potential biases in real-time screening processes.BiasEye provides users with clear global views of information, aiding in fair screening criteria formulation.It also helps identify biases by comparing actual rankings with modelpredicted ones, offering immediate means for adjustment.Results from a user study show that BiasEye significantly improves reviewers' decision-making by visualizing potential biases, suggesting its value across screening tasks.Future improvements may involve advanced machine learning algorithms and broader domain applications, including enterprise and government contexts.BiasEye development could inspire more tools for impartial decision-making and bias reduction.

Figure 4 :
Figure 4: A visualized indicator of the Statistical view.

Figure 5 :
Figure 5: A stacked time bar Ex-situ Table.
() with each data item (  ), resulting in an intermediate variable denoted as  (  ) =  •  =  =1   •   , where   represents the attribute value in the selected section.Subsequently, we mapped the values of  to the interval [() − 0.5,  () + 0.5], preserving two decimal places to enhance transparency and facilitate explanation.This mapping yielded the prediction score  ′ , with the condition that  ′  = 0 if   = 0.The prediction score is displayed in the 'Mitigate' column, and a notification appears in the bottom right of the page, listing the top  (= 10) significant model attributes and training application IDs.Users can LOCATE (D4) inconsistent decision outcomes by comprehensively comparing these predictions with their scores and cross-referencing this information with the significant attributes and original data.

Figure 6 :
Figure 6: Design of glpyhs in Comparison view.To address D1, we developed a visual glyph (Figure6) for comparing human scores and model predictions.Each glyph corresponds to an applicant, with the number denoting the application

Figure 7 :
Figure 7: Procedure of user study.

Figure 9 :
Figure 9: Differences in Revision Among Participants in Different Groups.(a) Differences in revision behavior during Phase I. (b) Differences in revision behavior during Phase II.(c) The average number of score modifications varies throughout the screening process in Phase I, with the horizontal axis representing application ID.The error bars indicate standard errors.(ns: p < .1;* : p < .05;* * : p < .01;* * * : p < .001;* * * * : p < .0001)

Figure 10 :
Figure 10: Differences in the score distribution between Baseline and BiasEye systems in Phase I.

Figure 11 :
Figure 11: Differences in the score distribution between Phase I (P1) and Phase II (P2).
The main contributions of this study include:

Table 1 :
Do you think the review process is affected by time and memory?[recencybias] (3) Do you think the sequence order of applications may affect your assessment?[contrast bias] (4) Did you objectively assess the shortcomings of applications when you have had a favorable impression of them?[confirmation bias] ExpectationWhat functions do you want to add or improve to the current screening/interview system?Interview with expertise reviewers.
committee manages a large volume of applications by randomly distributing and sequencing them.As noted by E3, E4, and E7, reviewers are instructed to target a suggested mean score, mitigating aggregation errors.Finding 4: Flexibility in reviewer work schedules.

Table 2 :
Structured information entries from resumes in JSON formats, the extra activity section are divided into three subcategories.*: The corresponding entries represent the content of each record in that section.

Table 3 :
Attributes for each screening section, four sections share the attributes of School Rank and Student Rank.#: The corresponding attributes represent the quantitative outcomes following aggregation.
*: Indicates that the attribute has been normalized.

Table 4 :
Demographic information of participants.Experienced means one has prior involvement in relevant screening assistance scenarios encompassing over 20 applications.Group A uses Baseline system, group B uses BiasEye system.