Large Language Models as Zero-Shot Conversational Recommenders

In this paper, we present empirical studies on conversational recommendation tasks using representative large language models in a zero-shot setting with three primary contributions. (1) Data: To gain insights into model behavior in"in-the-wild"conversational recommendation scenarios, we construct a new dataset of recommendation-related conversations by scraping a popular discussion website. This is the largest public real-world conversational recommendation dataset to date. (2) Evaluation: On the new dataset and two existing conversational recommendation datasets, we observe that even without fine-tuning, large language models can outperform existing fine-tuned conversational recommendation models. (3) Analysis: We propose various probing tasks to investigate the mechanisms behind the remarkable performance of large language models in conversational recommendation. We analyze both the large language models' behaviors and the characteristics of the datasets, providing a holistic understanding of the models' effectiveness, limitations and suggesting directions for the design of future conversational recommenders


INTRODUCTION
Conversational recommender systems (CRS) aim to elicit user preferences and offer personalized recommendations by engaging in interactive conversations.In contrast to traditional recommenders that primarily rely on users' actions like clicks or purchases, CRS possesses the potential to: (1) understand not only users' historical actions but also users' (multi-turn) natural-language inputs; (2) provide not only recommended items but also human-like responses for multiple purposes such as preference refinement, knowledgable discussion or recommendation justification.Towards this objective, a typical conversational recommender contains two components [10,41,64,74]: a generator to generate natural-language responses and a recommender to rank items to meet users' needs.
Recently, significant advancements have shown the remarkable potential of large language models (LLMs) 1 , such as ChatGPT [30], in various tasks [4,6,51,71].This has captured the attention of the recommender systems community to explore the possibility of leveraging LLMs in recommendation or more general personalization tasks [3,27,34,48,56].Yet, current efforts generally concentrate on evaluating LLMs in traditional recommendation settings, where only users' past actions like clicks serve as inputs [3,27,34,48].The conversational recommendation scenario, though involving more natural language interactions, is still in its infancy [16,63].Yes, there is a new Terminator movie.: [System] Have you seen the trailer for it? . [User]: I also need a sci-fi movie with my family, it should be lighthearted and enjoyable.

8.
Pretend you are a movie recommender system.I will give you a conversation between a user and you (a recommender system).
Based on the conversation, you reply me with 20 recommendations without extra sentences.
Here is the conversation: {}

🤖
Figure 1: Large Language Models (LLMs) as Zero-Shot Conversational Recommenders (CRS).We introduce a simple prompting strategy to define the task description  , format requirement  and conversation context  for a LLM, denoted as F , we then post-process the generative results into ranked item lists with processor Φ.
In this work, we propose to use large language models as zeroshot conversational recommenders and then empirically study the LLMs' [11,30,51,68] recommendation abilities.Our detailed contributions in this study include three key aspects regarding data, evaluation, and analysis.
Data.We construct Reddit-Movie, a large-scale conversational recommendation dataset with over 634k naturally occurring recommendation seeking dialogs from users from Reddit 2 , a popular discussion forum.Different from existing crowd-sourced conversational recommendation datasets, such as ReDIAL [41] and IN-SPIRED [22], where workers role-play users and recommenders, the Reddit-Movie dataset offers a complementary perspective with conversations where users seek and offer item recommendation in the real world.To the best of our knowledge, this is the largest public conversational recommendation dataset, with 50 times more conversations than ReDIAL.
Evaluation.By evaluating the recommendation performance of LLMs on multiple CRS datasets, we first notice a repeated item shortcut in current CRS evaluation protocols.Specifically, there exist "repeated items" in previous evaluation testing samples serving as ground-truth items, which allows the creation of a trivial baseline (e.g., copying the mentioned items from the current conversation history) that outperforms most existing models, leading to spurious conclusions regarding current CRS recommendation abilities.After removing the "repeated items" in training and testing data, we reevaluate multiple representative conversational recommendation models [10,41,64,74] on ReDIAL, INSPIRED and our Reddit dataset.With this experimental setup, we empirically show that LLMs can outperform existing fine-tuned conversational recommendation models even without fine-tuning.
Analysis.In light of the impressive performance of LLMs as zeroshot CRS, a fundamental question arises: What accounts for their remarkable performance?Similar to the approach taken in [53], we posit that LLMs leverage both content/context knowledge (e.g., "genre", "actors" and "mood") and collaborative knowledge (e.g., 2 https://www.reddit.com/"users who like A typically also like B") to make conversational recommendations.We design several probing tasks to uncover the model's workings and the characteristics of the CRS data.Additionally, we present empirical findings that highlight certain limitations of LLMs as zero-shot CRS, despite their effectiveness.
We summarize the key findings of this paper as follows: • CRS recommendation abilities should be reassessed by eliminating repeated items as ground truth.• LLMs, as zero-shot conversational recommenders, demonstrate improved performance on established and new datasets over fine-tuned CRS models.• LLMs primarily use their superior content/context knowledge, rather than their collaborative knowledge, to make recommendations.• CRS datasets inherently contain a high level of content/context information, making CRS tasks better-suited for LLMs than traditional recommendation tasks.• LLMs suffer from limitations such as popularity bias and sensitivity to geographical regions.
These findings reveal the unique importance of the superior content/context knowledge in LLMs for CRS tasks, offering great potential to LLMs as an effective approach in CRS; meanwhile, analyses must recognize the challenges in evaluation, datasets, and potential problems (e.g., debiasing) in future CRS design with LLMs.

LLMS AS ZERO-SHOT CRS 2.1 Task Formation
Given a user set U, an item set I and a vocabulary V, a conversation can be denoted as  = (  ,   , I  )   =1 .That means during the  th turn of the conversation, a speaker   ∈ U generates an utterance   = (  )  =1 , which is a sequence of words   ∈ V.This utterance   also contains a set of mentioned items I  ⊂ I (I  can be an empty set if no items mentioned).Typically, there are two users in the conversation  playing the role of seeker and recommender respectively.Let us use the 2 nd conversation turn in Figure 1 as an example.Here  = 2,   is [System],   is "You would love Terminator !" and I 2 is a set containing the movie Terminator.Following many CRS papers [10,41,64,74], the recommender component of a CRS is specifically designed to optimize the following objective: during the  th turn of a conversation, where   is the recommender, the recommender takes the conversational context (  ,   , I  )  −1  =1 as its input, and generate a ranked list of items Î that best matches the ground-truth items in I  .

Framework
Prompting.Our goal is to utilize LLMs as zero-shot conversational recommenders.Specifically, without the need for fine-tuning, we intend to prompt an LLM, denoted as F , using a task description template  , format requirement  , and conversational context  before the  th turn.This process can be formally represented as: To better understand this zero-shot recommender, we present an example in Figure 1 with the prompt setup in our experiments. 3odels.We consider several popular LLMs F that exhibit zero-shot prompting abilities in two groups.To try to ensure deterministic results, we set the decoding temperature to 0 for all models.
Processing.We do not assess model weights or output logits from LLMs.Therefore, we apply a post-processor Φ (e.g., fuzzy matching) to convert a recommendation list in natural language to a ranked list Î .The approach of generating item titles instead of ranking item IDs is referred to as a generative retrieval [7,60] paradigm.

DATASET
Ideally, a large-scale dataset with diverse interactions and realworld conversations is needed to evaluate models' ability in conversational recommendation.Existing conversational recommendation datasets are usually crowd-sourced [22,32,41,75] and thus only partially capture realistic conversation dynamics.For example, a crowd worker responded with "Whatever Whatever I'm open to any suggestion."when asked about movie preferences in ReDIAL; this happens since crowd workers often do not have a particular preference at the time of completing a task.In contrast, a real user could have a very particular need, as shown in Figure 2. ), an existing CRS dataset (Re-DIAL [41]), and our Reddit-Movie dataset.The Reddit-Movie dataset contains more information in its textual content compared to existing datasets where users often explicitly specify their preference.See Section 5.2 for quantitative analysis.
To complement crowd-sourced CRS datasets, we present the Reddit-Movie dataset, the largest-scale conversational movie recommendation dataset to date, with naturally occurring movie recommendation conversations that can be used along with existing crowd-sourced datasets to provide richer perspectives for training and evaluating CRS models.In this work, we conduct our model evaluation and analysis on two commonly used crowd-sourcing datasets: ReDIAL [41] and INSPIRED [22], as well as our newly collected Reddit dataset.We show qualitative examples from the Reddit dataset as in Figure 2 and quantitative analysis in Section 5.2.

Dataset Construction
To construct a CRS dataset from Reddit, we process all Reddit posts from 2012 Jan to 2022 Dec from pushshift.io 6.We consider movie recommendation scenarios 7 and extract related posts from five related subreddits: r/movies, r/bestofnetflix, r/moviesuggestions, r/netflixbestof and r/truefilm.We process the raw data with the pipeline of conversational recommendation identification, movie mention recognition and movie entity linking 8 .In our following evaluation, we use the most recent 9k conversations in Reddit-Movie base from December 2022 as the testing set since these samples occur after GPT-3.5-t'srelease.Meanwhile, GPT-4 [51] also mentioned its pre-training data cut off in Sept. 2021 9 .For other compared models, we use the remaining 76k conversations in Reddit-Movie base dataset for training and validation.2, we find that Reddit-Movie conversations tend to include more complex and detailed user preference in contrast to ReDIAL, as they originate from real-world conversations on Reddit, enriching the conversational recommendation datasets with a diverse range of discussions.

EVALUATION
In this section, we evaluate the proposed LLMs-based frameowrk on ReDIAL [41], INSPIRED [22] and our Reddit datasets.We first explain the evaluation setup and a repeated item shortcut of the previous evaluation in Sections 4.1 and 4.2.Then, we re-train models and discuss LLM performance in Section 4.3.

Evaluation Setup
Repeated vs. New Items.Given a conversation  = (  ,   , I  )   =1 , it is challenging to identify the ground-truth recommended items, i.e., whether the mentioned items I  at the  th ( ≤  ) turn are used for recommendation purposes.A common evaluation setup assumes that when   is the recommender, all items  ∈ I  serve as ground-truth recommended items.
In this work, we further split the items  ∈ I  into two categories: repeated items or new items.Repeated items are items that have appeared in previous conversation turns, i.e., { | ∃ ∈ [1, ),  ∈ I  }; and new items are items not mentioned in previous conversation turns.We explain the details of this categorization in Section 4.2.
Evaluation Protocol.On those three datasets, we evaluate several representative CRS models and several LLMs on their recommendation abilities.For baselines, after re-running the training code provided by the authors, we report the prediction performance using Recall@K [10,41,64,74] (i.e., HIT@K).We consider the means and the standard errors 10 of the metric with  = {1, 5}. 10 We show standard errors as error bars in our figures and gray numbers in our tables.
Compared CRS Models.We consider several representative CRS models.For baselines which rely on structured knowledge, we use the entity linking results of ReDIAL and INSPIRED datasets provided by UniCRS [64].Note that we do not include more works [43,50,54] because UniCRS [64] is representative with similar results.
• ReDIAL [41]: This model is released along with the ReDIAL dataset with an auto-encoder [58]-based recommender.
• KBRD [10]: This model proposes to use the DBPedia [1] to enhance the semantic knowledge of items or entities.
• KGSF [74]: This model incorporates two knowledge graphs to enhance the representations of words and entities, and uses the Mutual Information Maximization method to align the semantic spaces of those two knowledge graphs.• UniCRS [64]: This model uses pre-trained language model, DialoGPT [69], with prompt tuning to conduct recommendation and conversation generation tasks respectively.

Repeated Items Can Be Shortcuts
Current evaluation for conversational recommendation systems does not differentiate between repeated and new items in a conversation.We observed that this evaluation scheme favors systems that optimize for mentioning repeated items.As shown in Figure 3, a trivial baseline that always copies seen items from the conversation history has better performance than most previous models under the standard evaluation scheme.This phenomenon highlights the risk of shortcut learning [18], where a decision rule performs well against certain benchmarks and evaluations but fails to capture the true intent of the system designer.Indeed, the #HIT@1 for the models tested dropped by more than 60% on average when we focus on new item recommendation only, which is unclear from the overall recommendation performance.After manually checking, we observe a typical pattern of repeated items, which is shown in the example conversation in Figure 1.In this conversation, Terminator at the 6 th turn is used as the ground-truth item.The system repeated this Terminator because the system quoted this movie for a content-based discussion during the conversation rather than making recommendations.Given the nature of recommendation conversations between two users, it is more probable that items repeated during a conversation are intended for discussion rather

LLMs Performance
Finding 1 -LLMs outperform fine-tuned CRS models in a zero-shot setting.For a comparison between models' abilities to recommend new items to the user in conversation, we re-train existing CRS models on all datasets for new item recommendation only.
The evaluation results are as shown in Figure 4. Large language models, although not fine-tuned, have the best performance on all datasets.Meanwhile, the performance of all models is uniformly lower on Reddit compared to the other datasets, potentially due to the large number of items and fewer conversation turns, making recommendation more challenging.
Finding 2 -GPT-based models achieve superior performance than open-sourced LLMs.As shown in Figure 4, large language models consistently outperform other models across all three datasets, while GPT-4 is generally better than GPT-3.5-t.We hypothesize this is due to GPT-4's larger parameter size enables it to retain more correlation information between movie names and user preferences that naturally occurs in the language models' pre-training data.Vicuna and BAIZE, while having comparable performance to prior models on most datasets, have significantly lower performance than its teacher, GPT-3.5-t.This is consistent with previous works' finding that smaller distilled models via imitation learning cannot fully inherit larger models ability on downstream tasks [20].
Finding 3 -LLMs may generate out-of-dataset item titles, but few hallucinated recommendations.We note that language models trained on open-domain data naturally produce items out of the allowed item set during generation.In practice, removing these items improves the models' recommendation performance.Large language models outperform other models (with GPT-4 being the best) consistently regardless of whether these unknown items are removed or not, as shown in Table 2.Meanwhile, Table 3 shows that around 95% generated recommendations from GPT-based models (around 81% from BAIZE and 87% from Vicuna) can be found in IMDB 11 by string matching.Those lower bounds of these matching rates indicate that there are only a few hallucinated item titles in the LLM recommendations in the movie domain.

DETAILED ANALYSIS
Observing LLMs' remarkable conversational recommendation performance for zero-shot recommendation, we are interested in what accounts for their effectiveness and what their limitations are.We aim to answer these questions from both a model and data perspective.

Knowledge in LLMs
Experiment Setup.Motivated by the probing work of [53], we posit that two types of knowledge in LLMs can be used in CRS: • Collaborative knowledge, which requires the model to match items with similar ones, according to community interactions like "users who like A typically also like B".In Vicuna GPT-3.our experiments, we define the collaborative knowledge in LLMs as the ability to make accurate recommendations using item mentions in conversational contexts.• Content/context knowledge, which requires the model to match recommended items with their content or context information.In our experiments, we define the content/context knowledge in LLMs as the ability to make accurate recommendations based on all other conversation inputs rather than item mentions, such as contextual descriptions, mentioned genres, and director names.To understand how LLMs use these two types of knowledge, given the original conversation context  (Example in Figure 1), we perturb  with three different strategies as follows and subsequently re-query the LLMs.We denote the original as  0 : • S 0 (Original): we use the original conversation context.
• S 1 (ItemOnly): we keep mentioned items and remove all natural language descriptions in the conversation context.• S 2 (ItemRemoved): we remove mentioned items and keep other content in the conversation context.
Table 4: To understand the content/context knowledge in LLMs and existing CRS models, we re-train the existing CRS models using the same perturbed conversation context Item-Removed ( 2 ).We include the results of the representative CRS model UniCRS (denoted as CRS*) as well as a representative text-encoder BERT-small [15] (denoted as TextEnc*).• S 3 (ItemRandom): we replace the mentioned items in the conversation context with items that are uniformly sampled from the item set I of this dataset, to eliminate the potential influence of  2 on the sentence grammar structure.
Finding 4 -LLMs mainly rely on content/context knowledge to make recommendations.Figure 5 shows a drop in performance for most models across various datasets when replacing the original conversation text Original ( 0 ) with other texts, indicating that LLMs leverage both content/context knowledge and collaborative knowledge in recommendation tasks.However, the importance of these knowledge types differs.Our analysis reveals that content/context knowledge is the primary knowledge utilized by LLMs in CRS.When using ItemOnly ( 1 ) as a replacement for Original, there is an average performance drop of more than 60% in terms of Recall@5.On the other hand, GPT-based models experience only a minor performance drop of less than 10% on average when using ItemRemoved ( 2 ) or ItemRandom ( 3 ) instead of Original.Although the smaller-sized model Vicuna shows a higher performance drop, it is still considerably milder compared to using ItemOnly.
To accurately reflect the recommendation abilities of LLMs with ItemRemoved and ItemRandom, we introduce a new post-processor Table 5: To understand the collaborative knowledge in LLMs and existing CRS models, we re-train the existing CRS models using the same perturbed conversation context ItemOnly ( 1 ).We include the results of the representative CRS model Uni-CRS (denoted as CRS*) as well as a representative item-based collaborative model FISM [31] (denoted as ItemCF*).denoted as Φ 2 (describe in the caption of Figure 5).By employing Φ 2 , the performance gaps between Original and ItemRemoved (or ItemRandom) are further reduced.Furthermore, Figure 6 demonstrates the consistent and close performance gap between Original and ItemRemoved (or ItemRandom) across different testing samples, which vary in size and the number of item mentions in Original.
Finding 5 -GPT-based LLMs possess better content/context knowledge than existing CRS.From Table 4, we observe the superior recommendation performance of GPT-based LLMs against representative conversational recommendation or text-only models on all datasets, showing the remarkable zero-shot abilities in understanding user preference with the textual inputs and generating correct item titles.We conclude that GPT-based LLMs can provide more accurate recommendations than existing trained CRS models in an ItemRemoved ( 2 ) setting, demonstrating better content/context knowledge.
Finding 6 -LLMs generally possess weaker collaborative knowledge than existing CRS.In Table 5, the results from IN-SPIRED and ReDIAL indicate that LLMs underperform existing representative CRS or ItemCF models by 30% when using only the item-based conversation context ItemOnly ( 1 ).It indicates that LLMs, trained on a general corpus, typically lack the collaborative knowledge exhibited by representative models trained on the target dataset.There are several possible reasons for this weak collaborative knowledge in LLMs.First, the training corpus may not contain sufficient information for LLMs to learn the underlying item similarities.Second, although LLMs may possess some collaborative knowledge, they might not align with the interactions in the target datasets, possibly because the underlying item similarities can be highly dataset-or platform-dependent.
However, in the case of the Reddit dataset, LLMs outperform baselines in both Recall@1 and Recall@5, as shown in Table 5.This outcome could be attributed to the dataset's large number of rarely interacted items, resulting in limited collaborative information.The Reddit dataset contains 12,982 items with no more than 3 mentions as responses.This poses a challenge in correctly ranking these items within the Top-5 or even Top-1 positions.LLMs, which possess at least some understanding of the semantics in item titles, have the chance to outperform baselines trained on datasets containing a large number of cold-start items.
Recent research on LLMs in traditional recommendation systems [27,34,48] also observes the challenge of effectively leveraging collaborative information without knowing the target interaction data distribution.Additionally, another study [3] on traditional recommendation systems suggests that LLMs are beneficial in a setting with many cold-start items.Our experimental results support these findings within the context of conversational recommendations.

Information from CRS Data
Experimental Setup for Finding 7. To understand LLMs in CRS tasks from the data perspective, we first measure the content/context information in CRS datasets.Content/context information refers to the amount of information contained in conversations, excluding the item titles, which reasonably challenges existing CRS and favors LLMs according to the findings in Section 5.1.Specifically, we conduct an entropy-based evaluation for each CRS dataset and compare the conversational datasets with several popular conversation and question-answering datasets, namely DailyDialog (chit chat) [45], MsMarco (conversational search) [2], and HotpotQA (question answering).We use ItemRemoved ( 2 ) conversation texts like Section 5.1, and adopt the geometric mean of the entropy distribution of 1,2,3-grams as a surrogate for the amount of information contained in the datasets, following previous work on evaluating information content in text [29].However, entropy naturally grows with the size of a corpus, and each CRS dataset has a different distribution of words per sentence, sentences per dialog, and corpus size.Thus, it would be unfair to compare entropy between corpus on a per-dialog, per-turn, or per-dataset basis.To ensure a fair comparison, we repeatedly draw increasingly large subsets of texts from each of the datasets, compute the entropy of these subsets, and report the trend of entropy growth with respect to the size of the subsampled text for each CRS dataset.
Finding 7 -Reddit provides more content/context information than the other two CRS datasets.Based on the results in Figure 7a, we observe that the Reddit dataset has the most content/context information among the three conversational recommendation datasets.Those observations are also aligned with the results in Figure 5 and table 4, where LLMs -which possess better content/context knowledge than baselines -can achieve higher relative improvements compared to the other two datasets.Meanwhile, the content/context information in Reddit is close to question answering and conversational search, which is higher than existing conversational recommendation and chit-chat datasets.
Finding 8 -Collaborative information is insufficient for satisfactory recommendations, given the current models.Quantifying the collaborative information in datasets is challenging.Instead of proposing methods to measure collaborative information, we aim to make new observations based on general performance results presented in Figure 4 and recommendation results using only collaborative information in Table 5. Comparing the performance of the best models in Table 5 under an ItemOnly ( 1 ) setting with the performance of the best models in Figure 4 under an Original ( 0 ) setting reveals a significant disparity.For instance, on ReDIAL, the Recall@1 performance is 0.029 for ItemCF* compared to 0.046 for GPT-4, representing a 39.96% decrease.Similarly, for Reddit, the Recall@1 performance is 0.007 compared to 0.023 for GPT-4 both, which is 69.57% lower.We also experimented with other recommender systems, such as transformer-based models [33,59] to encode the item-only inputs and found similar results.Based on the current performance gap, we find that using the existing models, relying solely on collaborative information, is insufficient to provide satisfactory recommendations.We speculate that either (1) more advanced models or training methods are required to better comprehend the collaborative information in CRS datasets, or (2) the collaborative information in CRS datasets is too limited to support satisfactory recommendations.Recall@1 with GPT-4 Experimental Setup for Finding 9. To understand whether the collaborative information from CRS datasets are aligned with pure interaction datasets, we conduct an experiment on the Reddit dataset.
In this experiment, we first process the dataset to link the items to a popular interaction dataset ML-25M [21] 12 .We then experiment with two representative encoders for item-based collaborative filtering based on FISM [31] and Transformer [59] (TRM), respectively.We report the testing results on Reddit, with fine-tuning on Reddit (FT), pre-training on ML-25M (PT), and pre-training on ML-25M then fine-tuning Reddit (PT+FT).Note that since it is a linked dataset with additional processing, the results are not comparable with beforementioned results on Reddit.
Finding 9 -Collaborative information can be dataset-or platform-dependent.From Figure 7b shows that the models solely pre-trained on ML-25M (PT) outperform a random baseline, indicating that the data in CRS may share item similarities with pure interaction data from another platform to some extent.However, Figure 7b also shows a notable performance gap between PT and fine-tuning on Reddit (FT).Additionally, we do not observe further performance improvement when pre-training on ML-25M then fine-tuning on Reddit (PT+FT).These observations indicate that the collaborative information and underlying item similarities, even when utilizing the same items, can be largely influenced by the specific dataset or platform.The finding also may partially explain the inferior zero-shot recommendation performance of LLMs in Table 5 and suggest the necessity of further checking the alignment of collaborative knowledge in LLMs with the target datasets.

Limitations of LLMs as Zero-shot CRS
Finding 10 -LLM recommendations suffer from popularity bias in CRS.Popularity bias refers to a phenomenon that popular items are recommended even more frequently than their popularity would warrant [8]. Figure 8 shows the popularity bias in LLM recommendations, though it may not be biased to the popular items in the target datasets.On ReDIAL, the most popular movies such as Avengers: Infinity War appear around 2% of the time over all ground-truth items; On Reddit, the most popular movies such as Everything Everywhere All at Once appears less than 0.3% of the time over ground-truth items.But for the generated recommendations from GPT-4 (other LLMs share a similar trend), the most popular items such as The Shawshank Redemption appear around 5% times on ReDIAL and around 1.5% times on Reddit.Compared to the target datasets, LLMs recommendations are more concentrated on popular items, which may cause further issues like the bias amplification loop [8].Moreover, the recommended popular items are similar across different datasets, which may reflect the item popularity in the pre-training corpus of LLMs.
Finding 11 -Recommendation performance of LLMs is sensitive to geographical regions.Despite the effectiveness in general, it is unclear whether LLMs can be good recommenders across various cultures and regions.Specifically, pre-trained language models' strong open-domain ability can be attributed to pre-training from massive data [5].But it also leads to LLMs' sensitivity to data distribution.To investigate LLMs recommendation abilities for various regions, we take test instances from the Reddit dataset and obtain the production region of 7,476 movies from a publicly available movie dataset 13 by exact title matching, then report the Recall@1 for the linked movies grouped by region.We only report regions with more than 300 data points available to ensure enough data to support the result.As shown in Figure 9 the current best model, GPT-4's performance on recommendation is higher for movies produced in English-speaking regions.This could be due to bias in the training data -the left of Figure 9 show item on Reddit forums are dominated by movies from English-speaking regions.Such a result highlights large language model's recommendation performance varies by region and culture and demonstrates the importance of cross-regional analysis and evaluation for language model-based conversational recommendation models.

RELATED WORK
Conversational Recommendation.Conversational recommender systems (CRS) aim to understand user preferences and provide personalized recommendations through conversations.Typical traditional CRS setups include template-based CRS [13,26,37,38,70] and critiquing-based CRS [9,42,67].More recently, as natural language processing has advanced, the community developed "deep" CRS [10,41,64] that support interactions in natural language.Aside from collaborative filtering signals, prior work shows that CRS models benefit from various additional information.Examples include knowledge-enhanced models [10,74] that make use of external knowledge bases [1,47], review-aware models [49], and session/sequence-based models [43,76].Presently, UniCRS [64], a model built on DialoGPT [69] with prompt tuning [4], stands as the state-of-the-art approach on CRS datasets such as ReDIAL [41] and INSPIRED [22].Currently, by leveraging LLMs, [16] proposes a new CRS pipeline but does not provide quantitative results, and [63] proposes better user simulators to improve evaluation strategies in LLMs.Unlike those papers, we uncover a repeated item shortcut in the previous evaluation protocol, and propose a framework where LLMs serve as zero-shot CRS with detailed analyses to support our findings from both model and data perspectives.
In particular, existing work reveals language models' performance and sample efficiency on downstream tasks can be improved simply through scaling up their parameter sizes [35].Meanwhile, language models could further generalize to a wide range of unseen tasks by instruction tuning, learning to follow task instructions in natural language [52,57].Following these advances, many works successfully deploy large language models to a wide range of downstream tasks such as question answering, numerical reasoning, code generation, and commonsense reasoning without any gradient updates [5,35,44,72].Recently, there have been various attempts by the recommendation community to leverage large language models for recommendation, this includes both adapting architectures used by large language models [14,19] and repurposing existing LLMs for recommendation [39,48,62].However, to our best knowledge, we are the first work that provides a systematic quantitative analysis of LLMs' ability on conversational recommendation.

CONCLUSION AND DISCUSSION
We investigate Large Language Models (LLMs) as zero-shot Conversational Recommendation Systems (CRS).Through our empirical investigation, we initially address a repetition shortcut in previous standard CRS evaluations, which can potentially lead to unreliable conclusions regarding model design.Subsequently, we demonstrate that LLMs as zero-shot CRS surpass all fine-tuned existing CRS models in our experiments.Inspired by their effectiveness, we conduct a comprehensive analysis from both the model and data perspectives to gain insights into the working mechanisms of LLMs, the characteristics of typical CRS tasks, and the limitations of using LLMs as CRS directly.Our experimental evaluations encompass two publicly available datasets, supplemented by our newly-created dataset on movie recommendations collected by scraping a popular discussion website.This dataset is the largest public CRS dataset and ensures more diverse and realistic conversations for CRS research.We also discuss the future directions based on our findings in this section.
On LLMs.Given the remarkable performance even without finetuning, LLMs hold great promise as an effective approach for CRS tasks by offering superior content/contextual knowledge.The encouraging performance from the open-sourced LLMs [11,68] also opens up the opportunities to further improve CRS performance via efficient tuning [3,28] and collaborative filtering [36] ensembling.Meanwhile, many conventional tasks, such as debiasing [8] and trustworthy [17] need to be revisited in the context of LLMs.
On CRS.Our findings suggest the systematic re-benchmarking of more CRS models to understand their recommendation abilities and the characteristics of CRS tasks comprehensively.Gaining a deeper understanding of CRS tasks also requires new datasets from diverse sources e.g., crowd-sourcing platforms [22,41], discussion forums, and realistic CRS applications with various domains, languages, and cultures.Meanwhile, our analysis of the information types uncovers the unique importance of the superior content/context knowledge in LLMs for CRS tasks; this distinction also sets CRS tasks apart from traditional recommendation settings and urges us to explore the interconnections between CRS tasks and traditional recommendation [21] or conversational search [2] tasks.
He, et al. [User]: I love Back to the Future, any recommendations?You would love Terminator! :[System] [User]: Who is in it?Arnold Schwarzenegger!:[System] [User]: Did they make a new Terminator?

IFigure 2 :
Figure2: Typical model inputs from a traditional recommendation dataset (MovieLens[21]), an existing CRS dataset (Re-DIAL[41]), and our Reddit-Movie dataset.The Reddit-Movie dataset contains more information in its textual content compared to existing datasets where users often explicitly specify their preference.See Section 5.2 for quantitative analysis.

Figure 3 :
Figure3: To show the repeated item shortcut, we count CRS recommendation hits using the Top-K ranked list  = {1, 5}.We group the ground-truth hits by repeated items (shaded bars) and new items (not shaded bars).The trivial baseline copies existing items from the current conversation history in chronological order, from the most recent and does not recommend new items.

Figure 6 :
Figure 6: GPT-3.5-tRecall@5 results grouped by the occurrences of items in conversation context, and count the conversations per dataset.

Figure 7 :
Figure7: The left subfigure shows the entropy of the frequency distribution of 1,2,3-grams with respect to number of words drawn from each dataset (item names excluded) to measure the content/context information across datasets.The right subfigure shows the results of processed Reddit collaborative dataset aligned to ML-25M[21].RAND denotes random baseline, FT denotes fine tuning on Reddit, PT denotes pre-training on ML-25M, PT+FT means FT after PT.

Figure 8 :
Figure 8: Scatter plots of the frequency of LLMs (GPT-4) generated recommendations and ground-truth items.

Figure 9 :
Figure 9: Ground-truth item counts in Reddit by country (in log scale) and the corresponding Recall@1 by country.

Table 1 :
Dataset Statistics.We denote a subset of Reddit-Movie in 2022 as base, and the entire ten-year dataset as large.

Table 1
, we observe: (1) The dataset Reddit-Movie stands out as the largest conversational recommendation dataset, encompassing 634,392 conversations and covering 51,203 movies.(2) In comparison to ReDIAL [41] and IN-SPIRED [22], Reddit-Movie contains fewer multi-turn conversations, mainly due to the inherent characteristics of Reddit posts.(3) By examining representative examples depicted in Figure

Table 3 :
Fraction of Top-K ( = 20 in our prompt setup) recommendations (#rec) that can be string matched in the IMDB movie database (%imdb) for the different models, which shows a lower bound of non-hallucinated movie titles.
Figure5: Ablation studies for the research question about the primary knowledge used by LLMs for CRS.Here Φ 1 is the post-processor which only considers in-dataset item titles; Φ 2 is the post-processor based on Φ 1 and further excludes all seen items in conversational context from generated recommendation lists.For inputs like Original ( 0 ) and ItemOnly ( 1 ), LLMs show similar performance with Φ 1 or Φ 2 , so we only keep Φ 1 here.We consider Φ 2 because ItemRemoved ( 2 ) and ItemRandom ( 3 ) have no information about already mentioned items, which may cause under-estimated accuracy using Φ 1 compared to Original. 5 5