Enhancing Interface Design with AI: An Exploratory Study on a ChatGPT-4-Based Tool for Cognitive Walkthrough Inspired Evaluations

This paper introduces CWGPT, a ChatGPT-4-based tool designed for Cognitive Walkthrough (CW) inspired evaluations of web interfaces. The primary goal is to assist users, particularly students and inexperienced designers, in evaluating web interfaces. Our tool, operating as a conversational agent, provides detailed evaluations of a user-specified task by intelligently guessing the subtasks and actions required to accomplish them, answering the standard CW questions, and providing helpful feedback and practical suggestions to improve the usability of the analyzed interface. For our study, we selected a group of web applications designed by students from a Web and Software Architecture course. We compare the outcome of the CWs we executed on ten web apps against the corresponding CWGPT analyses. We then describe the study we conducted involving five author-students to assess the tool’s efficacy in helping them recognize and solve usability issues. In addition to introducing a novel adaptation of ChatGPT, the outcomes of the described experience underscore the promising potential of AI in usability evaluations.


INTRODUCTION
In the rapidly evolving field of Human-Computer Interaction (HCI), integrating Artificial Intelligence (AI) offers groundbreaking opportunities for enhancing user experience and interface design.This paper presents CWGPT, a conversational AI tool based on ChatGPT-4, designed to assist in the usability evaluation of web interfaces, providing users with an expert usability evaluation inspired by the well-known Cognitive Walkthrough method.
We address two research questions: • RQ1 Can ChatGPT-4 be effectively leveraged to build a tool to evaluate web interface usability?• RQ2 If so, is this tool beneficial for novices in interface design?
Regarding RQ1, we executed Cognitive Walkthroughs (CWs) on ten selected web applications and compared the outcomes with the analyses conducted by CWGPT.The web apps were selected from the exam projects submitted by the students of a Web and Software Architecture course, who were requested to develop a social web app for sharing photos.The results indicated the potential of ChatGPT-4 in effectively conducting CWs.To explore the potential benefits of CWGPT for novices approaching interface design, we conducted an exploratory user study involving five author-students.Acknowledging the involvement of only five users, we recognize that our research does not yield significant quantitative results but rather qualitative insights, and we argue it serves as a foundation for further investigation into the promising role of AI-based tools in interface design and validation processes.

RELATED WORKS
Our work is inspired by Cognitive Walkthrough, a review process in which a group of experts evaluate a design aspect in the context of one or more specific tasks.In general, the input to a walkthrough session includes an interface (whether a description, working prototype, or series of screenshots), a task scenario, assumptions on the user base and context of use, and a sequence of actions that should be performed to complete the designated task [10].As the process requires experts and detailed reviews, it is time-consuming and expensive.To the best of our knowledge, no other works currently exploit Large Language Models (LLM) to perform usability evaluation in the spirit of cognitive walkthroughs.
The use of LLM to imitate human interactions with interfaces is an emerging research field.In particular, in automatic Graphical User Interface (GUI) testing, researchers are experimenting by using LLM to receive realistic feedback from application testing.They let the model use the application by annotating its comments and the line of reasoning it follows [8,14].The use of LLM in UI layout generation is noteworthy, where the model aims to help identify target users and their needs and construct basic interfaces for the tasks [2,3].In particular, York [15] presented a work that experiments with the use of LLM to help novices and student designers develop ideas, designs, and code to begin their projects.Moreover, Schimdt [11] discussed the transformative impact of Generative Artificial Intelligence (GenAI) and Large Language Models (LLMs), emphasizing their capacity to expedite multiple stages within the development lifecycle of interactive systems.
Regarding the use of ChatGPT, recent works adopted it to simulate how a user would perform a task given the application [8,12,14].GPT agents [9] have also been used to receive feedback on different aspects of usability [14].These works test Android mobile applications, as they do not use the actual graphical interface as input but rather a natural language description of the application based on Android's hierarchy files.
More in detail, Liu et Al.[8] used ChatGPT-3 to discover tasks automatically, performing activity coverage tests and bug detection, finally showing significant improvements in these three tasks compared to the baseline models [6,7].Wen et Al.[12] used GPT to test the completion rate of the authors' designed tasks.The work highlights the limitations of using natural language descriptors of GUIs, as GPT failed to complete tasks involving unnamed elements, such as checkmark buttons and search boxes.Finally, Yoon et Al.[14] presented a GUI testing tool that uses different GPT agents to perform different tasks: planner, actor, observer, and reflector.
Ultimately, these works are limited by the natural language descriptor of the GUI, which, depending on the one employed, makes comparisons between tools unfair, as some elements might not be described satisfactorily.On the other hand, researchers responded positively to the comments left by GPTs, marking them as helpful in understanding real users' problems when navigating through interfaces.However, the focus of their work is on the capacity of LLM to perform realistic tests.

CWGPT
Leveraging the builder feature of ChatGPT-4, we developed CWGPT, a specialized conversational agent for facilitating CW-inspired evaluations in web interface design.This agent aims to integrate the analytical capabilities of AI with usability evaluation methodologies.CWGPT is published on ChatGPT and is available through ChatGPT-4.We decided to focus on web interfaces because of their widespread usage, ease of capture through screenshots, and the challenges they pose for novice designers, making them ideal for usability evaluation.Additionally, our access to students from the Web and Software Architecture course provided a readily available source of real-world web interfaces for testing, perfectly aligning with our target user base.
It is important to note that our tool does not autonomously navigate an interface by clicking connectors between views.Instead, it engages users interactively, requesting them to perform actions (e.g., clicking a button) to proceed.This approach enables the tool to explore the interface in an informed manner while actively involving users in the evaluation process.
For the development of CWGPT, we utilized OpenAI's GPT builder, a specialized tool crafted for constructing customized GPTbased chat interfaces.This builder functions as a conversational agent itself, enabling customization through natural language interactions.Throughout our development process, we iteratively refined our prompts to achieve the desired behavior for CWGPT.Our experience highlighted the importance of precise prompt engineering, a skill underscored in current literature [1].Ultimately, we found that linking each new instruction to the preceding one in the prompt yielded the most effective solution to meet our diverse range of requirements.CWGPT initiates each session by requesting a specific task for evaluation from the users, along with a screenshot of as starting interface.Relying on ChatGPT-4's advanced language processing abilities, CWGPT is not confined to predefined prompts, enabling it to process various user inputs, such as requests for assistance, clarification, or supplementary information.

CWGPT evaluation process
In the first interactive exploration phase, CWGPT hypothesizes the most probable subtasks and corresponding actions that lead to the completion of the target task (Fig. 1).Users are prompted to execute specific actions for each identified subtask and then provide a screenshot of the interface post-action.
After the execution of all the subtasks, CWGPT starts the evaluation phase.Applying the core principles of the CW method as outlined by Wharton et al. [13], it assesses each action in terms of its effectiveness in helping the user achieve their goals, the visibility and accessibility of the action, the user's ability to associate the action with the desired outcome, and recognition of progress towards task completion.
Upon concluding both phases, CWGPT assembles a comprehensive evaluation of the interface, including insights into the usability strengths and weaknesses of the interface, and practical with suggestions for enhancements.

EXPLORATORY STUDY 4.1 Comparison of human and CWGPT evaluations (RQ1)
4.1.1Methodology.We selected ten web applications displaying noticeable usability issues from projects developed during a Web and Software Architecture course.Notably, this course primarily focuses on development, and interface design is outside the scope of the course.All the selected interfaces have been included as supplementary material.The chosen task for evaluation across all interfaces was the standard action of "Upload a photo".Our objectives in this phase were twofold: firstly, to assess the capability of CWGPT in determining the sequence of actions required for task completion, and secondly, to evaluate whether CWGPT identified similar, fewer, or additional usability issues compared to humanconducted CWs.We ourselves conducted standard CWs for each interface.For each action in completing the task, we answered the traditional CW's four questions, with one of the four answers: 1) Yes, 2) Likely yes, 3) Likely no, 4) No, and with brief comments for non-affirmative answers.In the following, we refer to this phase as the experts-or human-led CWs.Concurrently, we employed CWGPT to evaluate the same set of interfaces.The style of the obtained CWGPT answers is similar, although more detailed and verbose.

Results
. CWGPT consistently demonstrated its capability to determine task sequences independently, highlighting its effectiveness in usability evaluation.In one case, it considered the task completed while missing the last action of the sequence ("Click on the close button of the dialog"), terminating the session early.
Regarding the detection of usability issues, to quantify the agreement between the experts' CWs and the results obtained by CWGPT we collectively compared the answers to the CWs' questions.
The ten web interfaces analyzed required a total of 32 actions to perform the task "Upload a photo" (average of 3.2 actions per interface).For each action, a CW requires answering 4 questions, totaling 128 questions.We compared the answers from the human experts with those generated by CWGPT.If both the answers are either "Yes" or "Likely yes", or when they are both "No" or "Likely no", we count it as an agreement.Alternatively, we count it as a disagreement.Out of the 128 answers, as reported in Table 1, 116 showed agreement, 8 exhibited disagreement, and 4 could not be compared because, on one occasion, CWGPT did not recognize an action and thus did not answer the questions.
In summary, the results of this experiment phase indicate a clear trend of agreement between CWGPT and human experts.Regarding the quality of evaluation assessments, CWGPT consistently provided more detailed evaluations than the succinct "Yes" or "Likely"  1: Overall agreements between experts and CWGPT answers responses given by the experts.Additionally, each assessment conducted by CWGPT concluded with a comprehensive summary and practical suggestions for enhancing the interface.Notably, in all 10 instances, we agreed with CWGPT's summaries and suggestions, further highlighting the value of its assessments.As a negative note, we report that, among the reasons for disagreement, CWGPT did not pay attention to issues that were evident to human evaluators, or categorized them as "minor improvements".This tendency was particularly noticeable in interfaces heavily reliant on visual design, such as color schemes, layout, or visual hierarchy.

User study of CWGPT (RQ2)
4.2.1 Methodology.In the second phase of our study, we assessed the practical value and acceptance of CWGPT in aiding designers with web interface usability evaluations.Five student authors of selected interfaces from the first phase participated in the experiment.Interface selection was based on author availability rather than issue severity.Each student, aged 20-25, evaluated the "Upload a photo" task using CWGPT on their authored interface.Among them, four were male and one was female.Although four out of 5 had some HCI background, they required refreshers on CW specifics.All participants were familiar with ChatGPT.
Each test session, led by an interviewer and an observer, followed a structured procedure: • Initial Interview: We asked the student their age, HCI background, and their familiarity with CWs and ChatGPT.• Task Execution: The student independently performed the designated task, noting any observed mistakes or issues with the interface.• CWGPT Think Aloud: The student utilized CWGPT to conduct a Cognitive Walkthrough (CW) of their interface on the task "Upload a photo".Introduced briefly to CWGPT, they were encouraged to use it freely, similar to any other ChatGPT-based chat, possibly without any intervention by the interviewer.This session was conducted with the thinkaloud method.• Interface Changes Interview: Upon completion, the student was asked if they felt their interface needed improvements and agreed with the CWGPT evaluation.If so, the student was asked to describe which interface elements needed improvement.• Feedback and SUS: To conclude, the student was invited to share perspectives on the utility of CWGPT and its overall experience.Subsequently, they filled out a standard System Usability Scale (SUS) questionnaire [5].

Results.
In the initial phase of the test sessions (Task Execution), participants generally exhibited a positive outlook, believing their interfaces did not exhibit significant problems, though they acknowledged general room for improvement.Two participants shared that the task was relatively straightforward and did not reveal any immediate, critical issues.Then, participants were asked to use CWGPT to perform a CW evaluation on their interfaces, focusing on the "Upload a photo" task (CWGPT Think Aloud).While they managed the usage of CWGPT with relative ease, there were some notable observations.
Participants initially displayed some uncertainty in initiating the interaction with CWGPT.Two individuals needed help with how to commence and were suggested by the interviewer to request CWGPT's assistance directly.In contrast, the remaining participants attempted to initiate the process by sharing only part of the required prompt, either sending only the screenshot or solely the task.However, they were promptly corrected by CWGPT, facilitating their understanding of the necessary steps.
Of particular interest, one participant with no prior background in Human-Computer Interaction (HCI) encountered challenges in grasping the essence of the evaluation process, initially struggling with the concept of a "task".Nevertheless, this participant persisted in the conversation led by CWGPT and completed the session.Interestingly, upon reviewing the generated CW and overall evaluation, this student better understood the tool's purpose and utility.
During interactions, CWGPT sometimes showed uncertainty about the appropriate action to take, resulting in unnecessary steps.Participants were surprised but realized these variations were due to unclear design elements needing improvement.
In one specific case, CWGPT instructed the participant to perform an action related to an element that did not exist within the interface.This prompted uncertainty from the student regarding whether they could correct the instruction.However, it was noted that the student was not entirely surprised by this occurrence, understanding that AI can occasionally generate unexpected outputs or "hallucinations".After completing the CWGPT session (Interface Changes Interview), all students unanimously recognized the necessity for modifications and improvements to their interfaces.They comprehensively grasped the most pertinent usability issues identified during the evaluation process.Notably, one student expressed a minor disagreement with a suggestion received, deeming it "excessive" and questioning whether a human user would perceive it as a genuine problem.This suggestion related to adding labels to icons was initially ambiguous for the tool but validated as standard by the experts.In another instance, a student was so impressed with the tool's capabilities that they inquired about its capacity to analyze interfaces of different types and utilize it for a personal project.
Finally, participants were invited to share their general feedback and insights regarding their experience with CWGPT (Feedback and System Usability Scale (SUS)).Encouragingly, all students expressed satisfaction and contentment with the results achieved through CWGPT.They uniformly stated that the tool was easy to use and adept at identifying compelling issues.Notably, no significant negative feedback emerged during this phase, underlining the tool's overall positive reception.
The results trends of the SUS questionnaire, employed as a standard benchmark, were predominantly positive, particularly about questions assessing the system's complexity and usability.Due to the limited number of participants, we do not present the standard formulas for interpreting the results.The complete questionnaire and responses have been included as supplementary material.

DISCUSSION
Our exploratory study highlights promising aspects of CWGPT's usability in web interface evaluations.Comparing CWs by experts with CWGPT evaluations, we demonstrated its ability to discern task completion sequences and explore interfaces through screenshots, compensating for its inability to navigate directly.Usability assessments involving students further emphasized its userfriendliness and effectiveness in identifying issues, reinforced by positive feedback.However, several limitations require a discussion.
One concern is CWGPT's effectiveness in handling visual and design aspects, like color schemes and layout arrangement.For example, it did not recognize small buttons in remote corners as issues, unlike human evaluators.Moreover, providing CWGPT with entire page screenshots, including content below the fold, might lead to erroneous identifications of concealed controls, hindering its ability to detect usability issues.
Another concern is dynamic interfaces, which we did not directly assess but could pose challenges for CWGPT, especially beyond static screenshots.Moreover, the interfaces we tested were straightforward, leaving room for more thorough evaluations with more complex yet static interfaces.
Finally, like any AI-based tool, CWGPT may encounter errors due to AI unpredictability, such as "hallucinations" or conversation deviations.Even prompts crafted differently than expected might yield varying responses, potentially deviating from expected conversation flows.Therefore, it is crucial to acknowledge the inherent susceptibility to errors in AI [4] and adjust expectations accordingly when using the tool.
In summary, these limitations are important factors to consider when using CWGPT in usability evaluations.While our study provides promising insights, future research should explore these aspects in different settings to understand their implications comprehensively.CWGPT might replace experts in some usability testing aspects, especially where systematic evaluation using CW principles is required and expert availability is limited.Ultimately, the aim is to enhance usability awareness and practices among novices.

CONCLUSIONS AND FUTURE WORKS
The proposed conversation agent offers a user-friendly approach to usability assessment, making it particularly advantageous for students and novice designers seeking to enhance their interface design skills.With an exploratory study, we addressed the validity of CWGPT evaluations (RQ1) and the practical utility and acceptance of CWGPT in assisting designers (RQ2).
In future work, our efforts will focus on exploring the applicability of CWGPT beyond web interfaces, such as with mobile apps or other kinds of software interfaces, and how we can broaden the tool's scope and utility.