Leveraging ChatGPT for Automated Human-centered Explanations in Recommender Systems

The adoption of recommender systems (RSs) in various domains has become increasingly popular, but concerns have been raised about their lack of transparency and interpretability. While significant advancements have been made in creating explainable RSs, there is still a shortage of automated approaches that can deliver meaningful and contextual human-centered explanations. Numerous researchers have evaluated explanations based on human-generated recommendations and explanations to address this gap. However, such approaches do not scale for real-world systems. Building on recent research that exploits Large Language Models (LLMs) for RSs, we propose leveraging the conversational capabilities of ChatGPT to provide users with personalized, human-like, and meaningful explanations for recommended items. Our paper presents one of the first user studies that measure users’ perceptions of ChatGPT-generated explanations while acting as an RS. Regarding recommendations, we assess whether users prefer ChatGPT over random (but popular) recommendations. Concerning explanations, we assess users’ perceptions of personalization, effectiveness, and persuasiveness. Our findings reveal that users tend to prefer ChatGPT-generated recommendations over popular ones. Additionally, personalized rather than generic explanations prove to be more effective when the recommended item is unfamiliar.


INTRODUCTION
Recommender systems (RSs) have gained significant traction in various domains, revolutionizing how users discover and interact with information, products, and services.However, the often highly complex machine learning models used for generating personalized recommendations raise concerns about transparency and interpretability.
The demand for transparency and interpretability in RSs stems from the need to build user trust and confidence.Understanding the reasoning behind recommendations is crucial for users to make informed decisions and for organizations to comply with ethical and regulatory requirements.Consequently, significant efforts have been dedicated to creating explainable RSs [19].
Despite such advancements, there is still a shortage of automated approaches that can deliver meaningful and contextual humancentered explanations.Current methods often fall short in providing explanations that are interpretable, personalized, and capable of addressing individual users' diverse needs and preferences.Thus, there is a growing interest in developing novel techniques that leverage cutting-edge technologies to enhance the quality of explanations provided by RSs.
Several related works have examined the concept of explainable RSs through the lens of human-generated recommendations and explanations [1,2,12].These studies provide insight into how explanations can impact users if they are similar to how people themselves explain things to each other.However, this approach has limitations in terms of scaling for large RSs with millions of users and items.Additionally, it is difficult to ensure the quality of explanations provided by humans [1].LLM-based services like Chat-GPT have demonstrated remarkable natural language processing and generation capabilities, enabling human-like conversational interactions.LLMs have advanced to the point where machine and human-authored texts are indistinguishable for untrained individuals [3].Leveraging these conversational abilities, we propose using ChatGPT to provide users with personalized, human-like, and meaningful explanations for recommended items.
Research on RSs powered by LLMs has started to flourish [11,13,18,20].While many current RSs predominantly rely on user behavior data, LLMs extract and incorporate a wealth of knowledge from large-scale web corpora.This allows LLMs to possess knowledge that enriches behavioral data.For example, an RS powered by an LLM, such as ChatGPT, might suggest the best classic movies of all time, utilizing a zero-shot approach, even in the absence of historical data (ratings, clicks, viewing times) regarding the users' movie preferences.By framing user preference data within task prompts, the inherent knowledge within LLMs can be leveraged to produce personalized recommendations.The reasoning abilities of LLMs can discern user preferences from the context supplied in these prompts.
In this paper, we present one of the first user studies for measuring the users' perceptions of explanations provided by ChatGPT acting as an RS.With 94 participants, we evaluate how ChatGPT's personalized and generic explanations are perceived.We evaluate this for recommendations generated by ChatGPT (based on movies a person liked or disliked) and for a set of random (but popular) recommendations.We also investigate how people evaluate ChatGPT explanations for movies they should avoid (so-called disrecommendations).This research is anchored by the following central research question: "How do users experience and evaluate personalized explanations generated by ChatGPT?" Our results indicate that users tend to prefer ChatGPT's personalized recommendations over random selections of popular movies.Surprisingly, even when ChatGPT bases its explanations on users' movie preferences, they are not perceived as more personalized than generic ones unless the recommendations are random.This insight also extends to the perceived effectiveness and persuasiveness of the explanations.We further explore why these scenarios occur and investigate the interconnectedness between different explanation aspects like personalization, persuasiveness, and satisfaction.

RELATED WORK
Explainable RSs have been a significant research topic in the past [19].However, the recent emergence of LLMs has introduced new possibilities in this field, particularly regarding personalized explanations.We will now highlight some related works on explainable RSs in general and specifically focus on recent literature that utilizes LLMs for recommendation purposes.
Chang et al. [2] introduced a method for providing personalized natural language explanations for recommendations.They achieved this by employing crowd workers who synthesized explanations from movie reviews.The approach was integrated into the Movielens website and was found to be more efficient and effective than personalized tag-based explanations.This study emphasizes the benefits of using natural language over other approaches to provide meaningful explanations.
Evaluating explanation quality has become an increasingly important topic in recent years.Balog and Radlinski [1] proposed evaluating seven explanation goals originally introduced by Tintarev and Masthoff [16], namely effectiveness, efficiency, persuasiveness, satisfaction, scrutability, transparency, and trust.These goals align with the dimensions used by Chang et al. [2] and other researchers in the field.Balog and Radlinsky conducted a study where groups of crowd workers developed and evaluated personalized explanations of recommendations.They discovered that the seven goals are not independent and are often highly correlated, even when explanations were created to reflect a specific goal.Surprisingly, non-personalized explanations are equally as good as personalized ones, suggesting that crowd workers may lack the skills to create explanations for specific goals or that users cannot distinguish between them.
Lu et al. [12], in a user study similar to Balog and Radlinski [1], evaluated the effectiveness of machine-generated versus humangenerated explanations.The results indicated that users were more satisfied with human-generated explanations.Based on these findings, a new approach inspired by human explanations was proposed, resulting in even higher user satisfaction.
While the evaluation of explanations quality has predominantly been conducted in online settings, Li et al. [10] developed an approach for offline evaluation of explanations analogous to traditional evaluation methods for recommendations.The proposed approach formulates explanations as a ranking task, which can be evaluated using conventional ranking metrics such as nDCG and accuracy metrics such as precision and recall.
The emergence of LLMs like GPT-3 and GPT-4 has opened new avenues in RSs.For example, the works of Harrison et al. [7] and Huang et al. [8] employ GPT-4 to generate accurate and contextually relevant recommendations.Wang et al. [17] merges GPT-3.5 and GPT-4 embeddings into reasoning graphs to improve recommendation quality.Shu et al. [15] and Lyu et al. [13] delve into the personalization aspects of LLM-based RS, emphasizing their ability to provide tailored recommendations based on individual user behavior.Our work is particularly inspired by Gao et al. [6], who introduced Chat-REC, a conversational recommender system framework that converts user profiles and user-item interaction history into prompts and leverages ChatGPT to generate recommendations and explanations through natural language interaction with the user.The evaluation of Chat-REC only measures recommendation quality, where the results demonstrate that ChatGPT outperforms many common recommendation approaches regarding precision/recall and nDCG.The paper also outlines how ChatGPT can summarize the preferences of Movielens100k users and leverage those summaries in recommendation prompts.
While some of these works highlight how explanations can be generated, such as Chat-REC, where explanations are embedded as part of the conversation with the user, they do not specifically evaluate the effect of personalized recommendations and explanations generated by the underlying LLM.Zhou and Joachims [22] present one of the pioneering studies moving in this direction.Their study compares the effectiveness of ChatGPT-generated text reviews for movies in a mockup recommendation environment to human-written reviews.Through a survey involving 120 participants, they found no significant differences in how participants ranked movies or rated reviews for unfamiliar movies.However, ChatGPT-generated reviews were favored for movies participants had seen before.They also investigated the specific attributes of the review texts that influenced participants' perceptions of quality.
Informed by the aforementioned literature, we perform a user study in which we compare recommendations generated by Chat-GPT to random (but popular) recommendations and accompany these recommendations with explanations generated by ChatGPT that are either generic or personalized towards the users' preferences, the latter hereafter referred to as user-based explanations 1 .We posit the following research questions to bridge the existing gaps and extend our understanding of the personalized explanations and recommendations generated by ChatGPT: RQ1.How do users value personalized ChatGPT-generated recommendations when compared to random recommendations?RQ2.How do users perceive user-based versus generic ChatGPTgenerated explanations in relation to recommendation methods and explanation goals such as effectiveness, personalization, and persuasiveness?RQ3.Do user-based versus generic explanations work differently for familiar or unfamiliar movies?RQ4.How do explanation goals such as personalization, persuasiveness, and effectiveness relate?
RQ1 is grounded on the works of Harrison et al. [7], Huang et al. [8], and Gao et al. [6], which demonstrate the efficacy of GPT models in generating accurate and contextually relevant recommendations.However, these works do not specifically assess if users experience personalization.RQ1 seeks to fill this gap by measuring the level of personalization users experience in the recommendations.Since our investigation primarily revolves around comparing user-based versus generic explanations, it is key to identify whether users experience personalization from the beginning.This question is not designed to affirm whether ChatGPT is the pre-eminent recommender, but it serves to validate if ChatGPT can deliver a discernibly personalized experience through its recommendations when compared to random selections.RQ1 thus sets the tone for evaluating user-based versus generic explanations in the subsequent research questions.
For RQ2, the inspiration comes from the surprising results of Balog and Radlinski [1], who found non-personalized explanations to be just as good as personalized ones.This contradiction to the commonly held belief underlines the need to explore how users perceive user-based versus generic ChatGPT-generated explanations regarding various aspects such as recommendation method, effectiveness, and persuasiveness.
RQ3 extends the study of Zhou and Joachims [22] to evaluate the effectiveness of user-based versus generic explanations in the context of both familiar and unfamiliar movies, offering a comprehensive insight into the varying effects based on the user's prior knowledge about the recommended items.
RQ4 explores how the different explanation goals and user perceptions relate.This question is inspired by the work of Balog and Radlinski [1], who found strong correlations between explanation goals.However, Balog and Radlinski [1] did not explore these relations in more depth.Similar to earlier work in user-centric evaluation by Knijnenburg et al. [9], we explore to what extent the effectiveness of the explanation and satisfaction of the recommendation will depend on personalization and persuasiveness using path modeling.
To summarize, our outlined research questions are designed to build upon each other, offering a sequential understanding that progresses from establishing the basic effectiveness of ChatGPT in offering personalized recommendations to understanding the depth of its impact on varied contexts and settings involving explanations.

EXPERIMENT DESIGN
The user study took place across three separate batches on the Prolific crowdsourcing platform between June 8-15, 2023.Participants received an average hourly reward of £6.08 and were chosen based on their English fluency and a balanced gender distribution.A total of 94 participants concluded the survey, yielding 564 recommendations.
From the Prolific platform, participants were directed to a user interface we developed to carry out our experiment2 , employing a within-subject design in three stages.First, we requested users to provide a list of six movies, comprising three they enjoyed and three they disliked, to gather their preferences.Subsequently, we used the OpenAI Chat Completions API, which is powered by the GPT3.5-Turbomodel, to generate four recommendations and two disrecommendations, along with explanations, based on their stated preferences.Finally, we applied a questionnaire to collect user opinions about each of the provided recommendations and explanations.Below, we present the details for each step.

Collecting user preferences
We requested each user to provide six movies, equally divided between movies they liked and disliked (see Fig. 1).We integrated the OMDb API3 to assist users in locating the accurate movie title, with the corresponding movie poster displayed for easy verification, limiting the search for movies prior to 2021, due to the knowledge cutoff of GPT3.5-Turbo being September 2021.As soon as the user fills out the questionnaire, the Next button is enabled, and when clicked, the user is redirected to a waiting page while the recommendations and explanations are generated in the backend.

Generating recommendations and explanations
We used the provided preferences to generate recommendations and explanations.Each participant received four recommendations and two disrecommendations.Two of these recommendations and the two disrecommendations were generated by calls to the OpenAI Chat Completions API 4 .The other two recommendations were drawn at random from a pool of 594 movies, curated from the 50 most-rated movies in each genre on IMDb.By randomly recommending movies, we can control for user familiarity.If users were only recommended movies they already knew, it could skew their perceptions based on prior knowledge or experiences.Random recommendations could introduce movies that users might not be familiar with, allowing the research to isolate the effect of the explanation itself on user decisions rather than the user's prior knowledge of the movie.Inspired by Zhang et al. [21], we also show two movies they should avoid and the reasons why.By showing recommendations and disrecommendations, we provide a broader range of feedback opportunities.For instance, while a recommendation could be based on a user's preference, a disrecommendation provides information about what not to watch.This might be equally valuable for users, as avoiding a potentially bad movie experience can be as beneficial as finding a good one.
The prompts used to generate the recommendations and explanations are shown in Fig. 2. We created a base prompt containing the user's answers, as shown in the upper left part of the figure.This prompt will be concatenated with every prompt that requires personalization.The upper right box shows the prompt used to get the four recommendations generated by the GPT3.5-Turbomodel (two positive and two negative).Positive and negative recommendations are requested in the same prompt to improve performance by avoiding an extra request since we did not observe a difference when using two separate prompts.
The lower boxes in Fig. 2 present the prompts used for generating respectively user-based and generic explanations.An explanation is generated for each recommendation and disrecommendation, with an equal distribution of user-based and generic explanations.The only difference between the prompts for generating explanations for recommendations and disrecommendations is the not keyword added for the latter case.Each explanation was generated using a single separate request.
In addition to the text presented in Fig. 2, each prompt had some formatting guidelines, such as a description of how the output was to be formatted and an instruction not to recommend any of the movies given as input.In order to access the OpenAI's API, we used the provided python library openai5 by calling the Chat Completion creation interface (openai.ChatCompletion.create).We set the temperature to 0.0 for better reproducibility and defined no frequency or presence penalty.A full description of the effect of these parameters can be found in the Chat Completion API Reference 6 .
We randomized the presentation order of all four recommendations and two disrecommendations, with the disrecommendations invariably presented last.

Evaluating the Explanations
For each recommendation and disrecommendation, we presented the user with a page (see Fig. 3a) containing the poster and name of the (dis)recommended movie, with the explanation next to it, followed by two blocks of questions about the recommendation and explanation (see Fig. 3b) 7 .After fulfilling the evaluation for one particular (dis)recommendation, the user can go to the next one by clicking the next button.
The first set of questions inquires if participants recognized the movie (familiarity) and whether they would enjoy it (satisfaction).Subsequently, we asked three questions related to the explanations' helpfulness, personalization, and convincingness.Note that helpfulness refers to how effective the explanation was in helping users make informed decisions, while convincingness refers to how persuasive the explanation was in convincing users to consume the item, as defined by Tintarev and Masthoff [16].Hereafter, we will use the standard terminologies 'effectiveness' and 'persuasiveness' instead of 'usefulness' and 'convincingness,' respectively, in line with prevailing literature.To collect these opinions, we used statements with a 5-point Likert-scale answer option (strongly disagree, disagree, neutral, agree, strongly agree).

RESULTS
Our study involves each of the 94 participants providing six sets of responses to the four recommendations and two disrecommendations.In total, our collected data contains responses for 564 recommendations.To analyze our data, we use multilevel linear regression with a random intercept model that accounts for the repeated nature of participant responses.Our results are presented through plots displaying the estimated means from the models.Our initial focus is on the four positive recommendations, followed by a discussion on the two disrecommendations.

Recommendation Satisfaction
For RQ1, we evaluated how participants determined their potential satisfaction with a movie based on two factors: the personalization of the recommendations and their familiarity with the film.Both recommendation methods generated a substantial number of unfamiliar movies, i.e., users indicated that 25% of the ChatGPT-based recommendations were unfamiliar, whereas 49% of the random recommendations were unfamiliar.Our results show that participants found the movie less enjoyable when the recommendation was random rather than personalized by ChatGPT ( = −.53, < .001)and less enjoyable when the movie was unfamiliar rather than known ( = −.78, < .001),as shown in Fig. 4a.We did not find any interaction effect between familiarity and recommendation method.These findings suggest that ChatGPT-crafted recommendations are favored over random (but popular) ones, similarly for both familiar and unfamiliar movies.

User-based vs. Generic Explanations
Regarding RQ2, we compared user-based vs. generic explanations in terms of effectiveness, personalization, and persuasiveness.Our findings suggest that the explanations generated for ChatGPT recommendations are perceived as more effective than the explanations generated for random recommendations ( = .37, < .001).However, we did not observe any significant difference in the effectiveness of user-based versus generic explanations ( = −.10, = 0.37), as shown in Fig. 4b.Personalization was determined by asking participants if the explanations resonated with their movie preferences.Interestingly, we found that user-based explanations, which explicitly mentioned the movie preferences of the participant (see Fig. 8), were not perceived as significantly more personalized than generic ones.We further discuss this finding in Section 5.Only in the case of random recommendations, which typically score lower on personalization ( = −0.80, < .001),user-based explanations were perceived as more personalized than generic ones, as depicted in Fig. 4c and reflected in a significant interaction between the recommendation and explanation type variables in our model ( = 0.49,  < .01).Regarding persuasiveness, explanations were less convincing for random recommendations ( = −.47, < .001),with no significant difference between user-based and generic explanations.For ChatGPT-generated recommendations, user-based explanations appeared to be slightly less convincing than generic ones, as shown in Fig. 4d, though not significant ( = −.27, = .14).

Movie Familiarity Analysis
Regarding RQ3 (cf.Fig. 5), we analyzed if perceptions of effectiveness, personalization, and persuasiveness of explanations differed between familiar and unfamiliar movies.In terms of effectiveness, explanations for unknown movies are less effective in general ( = −.41, < .01),but as is clear from the Fig. 5a, this is mostly for generic explanations.User-based explanations are equally effective in all conditions and thus help quite well for unknown movies, as reflected in the positive interaction between explanation type and familiarity ( = 0.35,  < .05).For personalization, we also observe a main effect of familiarity ( = −.53, < .001),with explanations for unknown movies feeling less personalized in general as can be seen in Fig. 5b.The figure also seems to suggest user-based explanations are less affected by familiarity, but the interaction is not significant ( = .274, = .17).For persuasiveness, we find again that explanations for unknown movies are less convincing ( = −.58, < .05),but now we find a significant main effect of explanation type ( = −.28, < .05),interacting with familiarity ( = .50, < .05):as shown in Fig. 5c, the user-based explanations are somewhat less convincing for known movies but more for unknown movies.Overall, we see a consistent pattern in that user-based explanations, compared to generic ones, seem to work better, mostly in cases where movies are unknown.Having less prior knowledge about the movie, a user-based explanation that actually relates the movie to users' preferences is more influential than if the user knows about a movie.

Path Modeling of Explanation Types and Goals
In RQ4, we asked how our users' different perceptions and experiences relate.We tested a path model in which we try to predict to what extent explanation effectiveness depends on perceptions of personalization and persuasiveness of the explanation, as well as the level of movie satisfaction that users reported.Following the user-centric framework of Knijnenburg et al. [9], we see personalization and persuasiveness as perceptions (Subjective System Aspects: SSA), whereas satisfaction and effectiveness are experience-type constructs (EXP).Our conditions are the objective system aspects (OSAs) that affect SSAs and EXPs.
In line with the work of Balog and Radlinski [1], we find that persuasiveness, personalization, satisfaction, and effectiveness are correlated, and our path model shows in more depth how they relate.Our analysis finds that effectiveness is predicted by satisfaction, persuasiveness, and personalization, with persuasiveness being the strongest predictor.Satisfaction itself goes up with known movies and GPT-based recommendations (as we already showed in the analysis of RQ1) but is also directly affected by persuasiveness and personalization.Consistent with our analysis of RQ2, we find that personalization is affected by the type of explanation, type of recommendation, and their interaction and a small effect of familiarity of the movie (see again Fig. 4b that shows the same patterns) as well as directly by the level of persuasiveness 8 .Similar to our analysis of RQ3, we find that persuasiveness is affected by explanation type and familiarity (and their interaction) and by familiarity with the   movie.Hence, whereas the path model shows the same effects as discussed in the previous sections, it brings insights into how these effects relate.An effective explanation is personalized and persuasive and ideal for a satisfactory movie.We find that persuasiveness is a function of explanation and recommendation type and movie familiarity, and personalization depends on the explanation type, recommendation type, and persuasiveness.Together, this shows when an explanation is effective and how to achieve that.

Disrecommendations Analysis
We also analyzed the two disrecommendations.These disrecommendations were always generated by ChatGPT since random recommendations provide no guarantees that we are recommending movies users should avoid.So, in this case, we look only at the effects of explanation type and familiarity.
For disrecommendations, we find that generic explanations seem more effective than user-specific ones ( = −.32, < .05),with no  interaction with familiarity, as can be seen in Fig. 7a.For personalization, the effect is smaller and not significant.Concerning persuasiveness, we do find that user-specific explanations are somewhat less convincing ( = −.36, < 0.05).Overall, it seems explanations for disrecommendations do not benefit from user-based explanations.However, further analysis is needed to determine whether this effect is not a consequence of ChatGPT's reluctance to disrecommend, given OpenAI's efforts to avoid harmful and biased discussions 9 .

DISCUSSION
Our results highlight several interesting aspects of ChatGPT-based explanations and recommendations.In this section, we discuss the results specifically in terms of user perceptions, popularity bias, and limitations.

The Anatomy of Explanations and its Influence on the User's Perception
Our findings indicate that generic explanations can feel just as effective, persuasive, and personalized as user-based explanations, as noted by Balog and Radlinski [1].However, users tend to prefer user-based explanations when it comes to random and/or unfamiliar recommendations.We manually analyzed some of the explanations generated by the GPT model in search of common writing and/or argumentation patterns that could help us understand this finding.Some examples of user-based vs. generic explanations generated by the GPT3.5-Turbomodel for the three possible types of recommendations (GPT-generated recommendation, GPT-generated disrecommendation, and random recommendation) are depicted in Fig. 8.We highlighted four different types of arguments that we observed being used for generating the explanations.The first two, comparison with liked (light green) and disliked (light red) movies, are predominant in the user-based explanations.In these types of arguments, the model compares features of the recommended movie with the movies users stated their preferences for.The third type, named critic-similar arguments (light blue), are arguments that resemble specialized-critic opinions and hence may induce an authority bias in users, i.e., a psychological inclination to accept information from authority without critical evaluation.These arguments are featured mostly in generic explanations.Lastly, the fourth type of argument (light yellow) briefly describes the movie plot and can frequently be found in both types of explanations.
The finding that generic explanations were perceived as personalized as user-based explanations for ChatGPT-generated recommendations may be explained, to some extent, by the fact that the generic explanations highlighted movie features that were present in the personalized recommendations, which were tailored to the user's preferred movies.Thus, the user's preference signals might have been unintentionally integrated into the generic explanations.This, together with an authoritative tone, may have made generic explanations as appealing as the user-based ones.
We have found that both user-based and generic explanations share a persuasive tone that is conveyed through comparative arguments, such as "if you do this, then you will get that," and the frequent use of a second-person point of view.This observation may explain why we did not observe significant differences in the persuasiveness of the two types of explanations (see Section 4.2).In Fig. 8, we have highlighted the pronoun "you" in bold to underscore its frequent usage.The high occurrence of this type of sentence may be due to the fact that the prompt used to generate the explanation caused the model to speak to the audience or because the model itself is trained to use this type of discourse when interacting with humans.

Popularity Bias in GPT-generated Recommendations
While our primary focus in this paper is not to evaluate the quality of the GPT model's recommendations, we observed that popularity bias seems to impact movies with extremely high or low IMDb ratings, causing them to appear more frequently than expected in both recommendations and disrecommendations.Upon closer examination, we find evidence supporting this observation.There is a positive, albeit weak to moderate, correlation between the frequency of recommendations and the IMDb ratings for the GPT recommendations (Spearman's  = 0.467,  < 3e−5; Kendall's  = 0.383,  < 3e−5).Conversely, we observe a negative, albeit weak, correlation for the disrecommendations (Spearman's  = −0.268, < .05;Kendall's  = −0.205, < .05).No significant correlation was found for the random recommendations.
These findings suggest that the recommendations generated by the GPT model may indeed be influenced by popularity bias, but further analysis is needed to explore this bias's implications on the recommendations' overall performance.

Limitations
While our study has shown significant effects of GPT-generated explanations, the study comes with several limitations.Firstly, the study had a specific focus on the GPT-3.5 model, this was mainly due to the ease of use provided through its API 10 , which has few hardware requirements compared to other LLMs, such as Meta's Llama and its derivatives, since we are not responsible for running the model.This specific focus limits the generalization capabilities of our findings.In future work, we plan to explore the use of other LLMs.
Other limitations arise from the nature of LLMs and ChatGPT specifically and are expected and inherent to this type of technology.When interacting with ChatGPT, the way the prompt is formulated can significantly affect the outcome, even when using its API.For the purpose of our study, we created a standardized set of prompts (see Fig. 2) in order to minimize the effects of differences in output based on irregularities in the prompts given to Chat-GPT.Nevertheless, LLMs' (and ChatGPT's specifically) sensitivity to prompt formulation has not been explored in the scope of our work.Another limitation of our study is the arguably limited room for personalization, given that study participants only disclosed six preferences (three liked and three disliked movies).While, in traditional recommendation models, even a modest number of preferences suffice to generate reasonable recommendations [14], we have no insight into the finer details of how our ChatGPT-based recommender creates recommendations.It should, however, be noted that our study focuses on comparing personalized (user-based) and generic explanations.With that in mind, even a relatively modest level of personalization should suffice.In line with the limited insight into the finer details of ChatGPT, another limitation is that of explanation versus justification.In this context, we refer to an explanation as the ability to disclose the algorithmic reason why a certain item is recommended, i.e., an interpretation of the model, whereas by justification, we refer to a motivation as to why a certain item is recommended [5].Given the recommendation model  used in our work, the explanations generated by our system are rather justifications, i.e., human interpretable snippets informing the study participants why a certain item was recommended, than explanations.In the context of our study, the semantic differences between the terms explanation and justification should not have an effect on the outcomes given the study framing and motivations presented to study participants upon taking part in the study.
Lastly, and perhaps most importantly, the number of participants in a study like this may have an effect on the obtained results.However, given that we tested most of our effects within-subject, the data obtained from our sample of 94 participants should have sufficient power as each participant provided four evaluations for the recommendations and two for the disrecommendations.Moreover, also related to data, the crowd worker platform used in our study, Prolific, has been proven to generate higher quality data compared to other crowdworking platforms, e.g., Mechanical Turk, Crowdflower [4].To further ensure the quality of data, we reached out to several of our study participants after having finalized the study to ask about their experiences and reflections on participating in the study.Overall, the participants we contacted were responsive (100% response rate) and content with the experience.

CONCLUSION
This paper investigated how users experience and evaluate personalized explanations generated by ChatGPT.Our findings revealed that personalized recommendations from ChatGPT yielded higher user satisfaction than random (but popular) recommendations.This finding expands works based on LLMs-driven RSs that only evaluate the accuracy of recommendations through offline experiments such as Gao et al. [6], Harrison et al. [7].Interestingly, user-based explanations directly referring to the participant's movie preferences were not perceived as significantly more personalized than generic explanations unless the recommendations were randomly generated.User-based explanations were also not perceived as more effective and persuasive than generic ones, regardless of the recommendation type.This is in line with the findings of Balog and Radlinski [1], who found no significant differences between personalized and non-personalized explanations regarding different explanation goals.We observed that the features of the personalized recommendations may be leaked to the generic explanations to make them feel personalized, even when they do not explicitly mention the participant's movie preferences.We also noticed that ChatGPT has a bias toward producing persuasive explanations.These observations may explain to some extent our findings, although further analysis is needed.
Furthermore, user-based explanations were perceived as somewhat more effective, personalized, and persuasive for unfamiliar movies, perhaps because prior knowledge about movies had less influence on decision-making, leaving more room for explanations to influence users' choices.This finding contradicts the results of Zhou and Joachims [22], who found higher effects for familiar movies.Although our study is not directly comparable to theirs, this difference motivates future works on the impact of movie familiarity on users' perception of what makes a good recommendation/explanation.Regarding disrecommendations, explanations did not seem to benefit from user-based explanations.
We also conducted a path modeling analysis to shed light on explanation types and goal interdependencies.Our analysis revealed that explanation effectiveness is strongly predicted by users' satisfaction, persuasiveness, and personalization perceptions, with persuasiveness exerting the most significant influence.As for personalization, it was influenced by various factors, including the type of explanation, recommendation, and their interaction, as well as persuasiveness.In summary, our path model elucidates the conditions that lead to effective explanations and how to achieve them, providing more in-depth insights into the correlations between explanation goals as observed before by Balog and Radlinski [1].
In future work, we plan to conduct a more comprehensive evaluation, including a larger sample and additional explanation goals and properties to better understand the factors that users perceive most effective in natural language LLM-generated explanations.

Figure 1 :
Figure1: Overview of user preferences elicitation.The user was asked three movies they liked (Name three of your favorite movies.)and three movies they disliked (Name three movies that you really disliked (or hated).).Here, the user answers the question about the liked movies and searches for pirates in the search bar of the disliked movies.

Figure 2 :
Figure 2: Prompts used for generating recommendations and explanation from the OpenAI GPT3.5-Turbo model (a) Evaluation page presented to the user (b) Questionnaire

Figure 3 :
Figure 3: Overview of the evaluation step.
Question: This explanation helps me to determine how well I will like this movie.Question: This explanation is convincing.

Figure 4 :
Figure 4: Questionnaire results for the (four) recommendations based on estimated means from the random intercept multilevel regressions.Responses were given on a 5-point, disagree-agree scale, for which 3 means neutral.Error bars are one standard error of the mean.

Figure 6 :
Figure 6: Path model showing how persuasiveness and personalization of the explanations are affected by the conditions and how they subsequently predict satisfaction and effectiveness.OSA=Objective System Aspect, SSA=Subjective System Aspect, EXP=Experience.The thickness of the line represents the strength of the coefficient.Standard errors in brackets, significance: * p<.05, **, p<.01, ***, p<.001

Figure 7 :
Figure 7: Questionnaire results for the two disrecommendations, based on estimated means from the random intercept multilevel regressions.Responses were given on a 5-point, disagree-agree scale, for which 3 means neutral.Error bars are one standard error of the mean.

Figure 8 :
Figure 8: User-based vs. Generic Explanations for three recommendations obtained from real users.