PrefRec: Recommender Systems with Human Preferences for Reinforcing Long-term User Engagement

Current advances in recommender systems have been remarkably successful in optimizing immediate engagement. However, long-term user engagement, a more desirable performance metric, remains difficult to improve. Meanwhile, recent reinforcement learning (RL) algorithms have shown their effectiveness in a variety of long-term goal optimization tasks. For this reason, RL is widely considered as a promising framework for optimizing long-term user engagement in recommendation. Though promising, the application of RL heavily relies on well-designed rewards, but designing rewards related to long-term user engagement is quite difficult. To mitigate the problem, we propose a novel paradigm, recommender systems with human preferences (or Preference-based Recommender systems), which allows RL recommender systems to learn from preferences about users' historical behaviors rather than explicitly defined rewards. Such preferences are easily accessible through techniques such as crowdsourcing, as they do not require any expert knowledge. With PrefRec, we can fully exploit the advantages of RL in optimizing long-term goals, while avoiding complex reward engineering. PrefRec uses the preferences to automatically train a reward function in an end-to-end manner. The reward function is then used to generate learning signals to train the recommendation policy. Furthermore, we design an effective optimization method for PrefRec, which uses an additional value function, expectile regression and reward model pre-training to improve the performance. We conduct experiments on a variety of long-term user engagement optimization tasks. The results show that PrefRec significantly outperforms previous state-of-the-art methods in all the tasks.


INTRODUCTION
Recent recommendation systems have achieved great success in optimizing immediate engagement such as click-through rates [19,35].However, in real-life applications, long-term user engagement is more desirable than immediate engagement because it directly affects some important operational metrics, e.g., daily active users (DAUs) and dwell time [44].Despite the great importance, how to effectively optimize long-term user engagement remains a significant challenge for existing recommendation algorithms.The difficulties are mainly on i) the evolving of long-term user engagement lasts for a long period; ii) factors that affect long-term user engagement are usually non-quantitative, e.g., users' satisfactory; and iii) the learning signals which are used to update our recommendation strategies are sparse, delayed, and stochastic.When trying to optimize long-term user engagement, it is very hard to relate the changes in long-term user engagement to a single recommendation [41].Moreover, the sparsity in observing the evolution of long-term user engagement makes the problem even more difficult.
Reinforcement learning (RL) has demonstrated its effectiveness in a wide range of long-term goal optimization tasks, such as board games [33,34], video games [11,39], robotics [25] and algorithmic discovery [13].Conventionally, when trying to solve a real-world problem with reinforcement learning, we need to first formulate it as a Markov Decision Process (MDP) and then learn an optimal policy that maximizes cumulative rewards or some other user-defined reinforcement signals in the defined MDP [7,36].Considering the characteristic that RL seeks to optimize cumulative rewards, it is rather suitable for optimizing long-term signals, such as user stickiness, in recommendation [41,53].As in Figure 1, recommender systems can be modeled as an agent to interact with users which serve as the environment.Each time the agent completes a recommendation request, we can record the feedback and status changes of users to calculate a reward as well as the new state for the agent.Applying RL will lead to a recommendation policy which optimizes user engagement from a long-term perspective.
Despite that RL is an emerging and promising framework to optimize long-term engagement in recommendation, it heavily relies on a delicately designed reward function to incentivize recommender systems behave properly.However, designing an appropriate reward function is very difficult especially in large-scale complex tasks like recommendation [5,6,10,41].On one hand, the reward function should be aligned with our ultimate goal as much as possible.On the other hand, rewards should be sufficiently dense and instructive to provide step-by-step guidance to the agent.For immediate engagement, we can simply use metrics such as click-through rates to generate rewards [48,51].Whereas for long-term engagement, the problem becomes rather difficult because attributing contributions to long-term engagement to each step is really tough.If we only assign rewards when there is a significant change in long-term engagement, learning signals could be too sparse for the agent to learn a policy.Existing RL recommender systems typically define the reward function empirically [10,44,53] or use short-term signals as surrogates [41], which will severely violate the aforementioned requirements of consistency and instructivity.
To mitigate the problem, we propose a new training paradigm, recommender systems with human preferences (or Preferencebased Recommender systems), which allows RL recommender systems to learn from human feedback/ preferences on users' historical behaviors rather than explicitly defined rewards.We demonstrate that RL from human preferences (or preference-based RL), a framework that has led to successful applications such as ChatGPT [29], is also applicable to recommender systems.Specifically, in PrefRec, there is a (virtual) teacher giving feedback about his/her preferences on pairs of users' behaviors.We use the feedback (stored in a preference buffer) to automatically train a reward model which generates learning signals for recommender systems.Such preferences are easy to obtain because no expert knowledge is required, and we can use technologies such as crowdsourcing to easily gather a large number of labeled data.Furthermore, to overcome the problem that the reward model may not work well for some unseen actions, we introduce a separate value function, trained by expectile regression, to assist the training of the critic in PrefRec 1 .Our main contributions are threefold: 1 PrefRec adopts the framework of actor-critic in reinforcement learning.

PRELIMINARIES 2.1 Long-term User Engagement in Recommendation
Long-term user engagement is an important metric in recommendation, and typically reflects as the stickiness of users to a product.In general, given a product, we expect users to spend more time on it and/or use it as frequently as possible.In this work, we assume that users interact with the recommender systems on a session basis: when a user accesses a product, such as an App, a session begins, and it ends when the user leaves.During each session, users can launch an arbitrary number of recommendation requests as they want.Such session-based recommender systems has been widely deployed in real-life applications such as short-form videos recommendation and news recommendation [40,45,50,51].We are particularly interested in increasing i the number of recommended items that users consume during each visit; ii the frequency that users visit the product.
Optimizing these two indicators is nontrivial because it is difficult to relate them to a single recommendation.For example, if a user increases its visiting frequency, we are not able to know exactly which recommendation leads to the increase.To this end,  we propose to use reinforcement learning to take into account the potential impact on the future when making decisions.

Recommendation as a Markov Decision Process (MDP)
Applying RL to recommender systems requires defining recommendation as a Markov Decision Process (MDP).Recommender systems can be described as an agent to interact with users, who act as the environment.Formally, we formulate recommendation as a Markov Decision Process (MDP) ⟨S, A, P, R, ⟩: • S is the continuous state space. ∈ S indicates the state of a user which contains static information such as gender and age; and dynamic information, such as the rate of likes and retweets.A state is what the recommender systems relies on to make decisions.• A is the continuous action space, where  ∈ A is an action which has the same dimension as the representation of recommendation items.We determine the item to recommend by comparing the similarity of an action and item representations [44,48].
(1) We seek to optimize the policy  (|) so that the return obtained by the recommendation agents is maximized: where    (•) denotes the state visitation frequency at step  under the policy .By optimizing the above objective, the agent can achieve the largest cumulative return in the defined MDP.

CHALLENGES OF DESIGNING THE REWARD FUNCTION
Despite that RL is a promising approach to optimize long-term user engagement, the application of RL heavily relies on a welldesigned reward function.The reward function must be able to reflect the changes in long-term engagement at each time step.In the meantime, it should be able to provide instructive guidance to the agent for optimizing the policy.Practically, quantifying rewards properly is very challenging because it is really difficult to relate changes in long-term engagement to a single recommendation [41].For example, when recommending a video to a user, we have no way of knowing how many videos the user will continue to watch on the platform before exiting the current session, and obviously, how this amount will be affected by the recommendation is even harder to know.For this reason, it is challenging to give a reward regarding the impact of recommended videos on the average video consumption of users.Similarly, recommending a video cannot be used to predict when the user will revisit the platform after exiting the current session.Designing rewards related to the visiting frequency of users is also difficult.As a compromise solution, one could only assign rewards at the beginning or the end of a session.However, this kind of reinforcement signals will be too sparse for the recommender systems to learn a reasonable policy, especially when the session length is large [44].Existing methods either design the reward function highly empirically [10,44,53] or use immediate engagement signals as surrogates [41], which will cause deviation between the optimization objective and the real long-term engagement.For this reason, it is urgent to propose a framework to address the difficulties in reward designing when using RL to optimize long-term user engagement.

RECOMMENDER SYSTEMS WITH HUMAN PREFERENCES
To resolve the difficulties in designing the reward function, we propose a novel paradigm, Peference-based Recommender systems (PrefRec), which allows an RL recommender systems to learn from preferences on users' historical behaviors rather than explicitly defined rewards.By this way, we can overcome the problems in designing the reward function when optimizing long-term user engagement.In this section, we first introduce how to utilize preferences to generate reinforcement signals for learning a recommender systems.Then, we discuss how to optimize the performance of Pre-fRec by using expectile regression to better estimate the value function.Next, we propose to pre-train the reward function to stabilize the learning process.Last, we summarize the algorithm.

Reinforcing from Preferences
While the reward for a recommendation request is hard to obtain, preferences between two trajectories of users' behaviors are easy to determine.For example, if one trajectory shows a transition from low-active to high-active and the other shows an opposite trend or an insignificant change, we can easily indicate the preference between them.Labeling preferences does not require any expert knowledge, so we can easily use techniques such as crowdsourcing to obtain a large amount of feedback on preferences.In PrefRec, we assume that there is a human teacher providing preferences between user's behaviors and the recommender systems uses this feedback to perform the task.There are mainly two advantages in using preferences: i) labeling the preference between a pair of trajectories is quite simple compared to designing rewards for every step; and ii) the recommender systems is incentivized to learn the preferred behavior directly because reinforcement signals come from preferences.We provide the framework of recommender systems with human preferences in Figure 2. As we can find, there is a teacher providing preferences between a pair of users' behavioral trajectories 2 .The teacher could be humans or or even a program with labeling criteria.For trajectory 1, the user increases its visiting frequency gradually and consumes more and more items in each session, which indicates that it is satisfied with the current recommendation policy.On the contrary, trajectory 2 shows the user is becoming less and less active, suggesting that the current recommender system should be improved to better serve the user.The teacher will obviously prefer trajectory 1 to trajectory 2. After generating such preference data, we use them to automatically train a reward function r (, ; ), parameterized by  , in an end-to-end manner.The preference-based recommender systems uses the predicted reward r (, ; ) rather than hand-crafted rewards to update its policy.

Optimizing the Recommendation Policy
When learning the recommendation policy, there is a replay buffer D  storing training data.Unlike conventional RL recommender systems, we store (, ,  ′ ) rather than (, , ,  ′ ) in the buffer D  because we cannot obtain explicit rewards from the environment 3 .We utilize the learned reward model r (, ; ) to label the reward for a tuple (, ).After labelling, an intuitive approach is to learn the Q-function by minimizing the following temporal difference 3 With a slight abuse of notation, we use  ′ to denote the next state.
(TD) error: (5) where D  is the replay buffer,  (, ;  ) is a parameterized Qfunction,  (, ; θ ) is a target network (e.g., with soft updating of parameters defined via Polyak averaging), and  (•|) is the recommendation policy.However, in practice, we find that this method does not perform well.It may be because the recommendation policy  (•|) will choose significantly different actions from the stored data, making the reward function and the Q-function unable to predict the corresponding values.To resolve this, we introduce a separate value function (V-function) that predicts how good or bad a state is.By doing so, we can eliminate the uncertainty that comes with the recommended policy.Instead of minimizing the TD error in Eq. 5, we turn to minimize the following loss to learn the Q-function: where  ( ′ ; ) is the V-function with parameters .Given the replay buffer D  , the relationship between Q-function and V-function is Conventionally, we can optimize the parameters of the V-function by minimizing the following Mean Squared Error (MSE) loss: However, such V-function corresponds to the behavior policy which collects the replay buffer D  .We want to achieve improvement upon the behavior policy.Inspired by expectile regression [23,24], we let the V-function to regress the  expectile ( ≥ 0.5) of  (, ) rather than the mean statistics as in Eq. 7. Then the loss for Vfunction becomes: where   () = | − I( < 0)| 2 , I(•) is the indicator function.Specially, if  = 0.5, Eq. 8 and Eq. 9 are identical.For  > 0.5, this asymmetric loss (Eq.9) downweights the contributions of  (, ; θ ) smaller than  (; ) while giving more weights to larger values (as in Fig. 3, left).Fig. 3 (right) illustrates expectile regression on a two-dimensional distribution: increasing  leads to more data points below the regression curve.Back to the learning of the Vfunction, the purpose is to let  (; ) to regress an above-average value.The Q-function is jointly trained with the V-function, while it also serves as the critic to guide the update of the recommendation policy, i.e., the actor.The recommendation policy is optimized by minimizing the loss: where  (•, ; ) is the recommendation policy with parameters .

Pre-training the Reward Function
The reward function is used to automatically generate reinforcement signals to train the recommendation policy.However, A reward model that is not well-trained will cause the collapse of the recommendation policy.Thus, to stabilize the training, we propose to pre-train the reward function before starting the updating of the recommendation policy.Specifically, we first prepare a preference buffer that stores a pool of preference feedback between interaction histories.Then we initialize a deep neural network with the structure of multi-layer perceptron to learn from the preference buffer.The neural network is updated by minimizing the loss in Eq. 4. After training for several episodes, we start the training of the recommendation policy.At this phase, the reward model is updated simultaneously with the recommender systems.By doing so, the reward model can handle potential shift in human preferences and can therefore generate more accurate learning signals.

Overall Algorithm
We provide the overall algorithm of PrefRec in Algorithm 1.A preference buffer D  and a replay buffer D  is required for training PrefRec.From lines 1 to 2, we do the initialization for the deep neural networks and set hyper-parameters such as soft-update rate.
From lines 3 to 5, we pre-train the reward model to confirm that it is able to provide reasonable learning signals when updating the recommendation policy.Lines 7 to 18 describe the training process of the recommendation policy.We first sample a mini-batch of transitions from the replay buffer D  .Then we lable the samples data with the reward model.Following that, we update the parameters of the V-function, the Q-function and the recommendation policy accordingly.Finally, if fine-tuning the reward model, we will train the reward model to keep it up-to-date.

Discussions
PrefRec differs from both inverse reinforcement learning (IRL) [28] and model-based RL [54].The key differences between PrefRec and IRL include that IRL requires costly expert demonstrations while PrefRec only requires simple label-like feedback.The reward function in PrefRec is learned by aligning with human preferences, whereas in IRL it is inferred from expert demonstrations.On the other hand, model-based RL relies on environmental rewards and aims to simplify learning by approximating the reward and transition functions.In contrast, PrefRec does not require environmental rewards and is designed to be learned without them.

EXPERIMENTAL RESULTS
We conduct extensive experiments to evaluate our algorithm.In particular, we will answer the following research questions (RQs): • RQ1: Can the framework of PrefRec lead to improvement in long-term user engagement?• RQ2: Whether PrefRec is able to outperform existing stateof-the-art methods?• RQ3: Whether the learned reward signals is able to reflect the true underlying rewards?• RQ4: How do the components in PrefRec contribute to the performance?

Preparing the dataset
Since PrefRec is a new framework in recommendation, there is no available dataset for evaluation.To prepare the dataset, we track complete interaction histories of around 100, 000 users from a leading short-form videos platform for months, during which over 25 millions recommendation services are provided 4 .For each user, we record its interaction history in a session-request form: the interaction history consists of several sessions with each session containing a number of recommendation requests (see Sec. 2.1).
The recommender system provides service at each recommendation request where we record the state of a user and the action of the recommender system.Each state is a 245-dimensional vector containing the user's state information, such as gender, age and historical like rate.We applied scaling, normalization and clipping to each dimension of states in order to stabilize the input of models.An action is a 8-dimensional vector which is the representation of recommended item at a request.We record the timestamp of each time a user starts and exits a session.With the timestamp, we can calculate the duration that a user revisits the platform and thus can infer the visiting frequency.We measure changes in long-term user engagement at the end of each session for each user.Specifically, we first calculate the average session depth    and the average revisiting time    for each user  during the time span.For each session,    denotes the number of requests in the -th session of the user .If    is larger than the average session depth    , we consider that there is an improvement in session depth.Similarly, we can calculate the revisiting time    for session  and if it is less than the average revisiting time    , we consider that there is an improvement in visiting frequency.We quantify the changes in session depth and visiting frequency into six levels (level 0 to level 5; the more higher level, the more positive change) where the levels are determined by the following equations, respectively: We provide the proportion of the calculated levels of all sessions in very few of them are located in level 4 and 5.The visiting frequency demonstrates a more even distribution.
After processing the data, we can prepare the two buffers, i.e., the preference buffer D  and the replay buffer D  , which are used in PrefRec.For the preference buffer D  , we uniformly sample 20, 000 pairs of users who have launched for more than 200 recommendation requests in the platform.We set the length of trajectory  as 100 and randomly sample a segment of trajectories with this length from the interaction histories of the selected users.To generate the preferences, we write a scripted teacher who provides its feedback by comparing cumulative levels of changes on the trajectory segments.For the replay buffer D  , we randomly sample 80% of users as the training set and split their interaction histories into transitions (, ,  ′ )to fill up the replay buffer D  .

Baselines
We compare PrefRec with a variety of baselines, including reinforcement learning methods for off-policy continues control (DDPG, TD3, SAC), offline reinforcement learning algorithms (TD3_BC, BCQ, IQL), and imitation learning: • DDPG [26]: An off-policy reinforcement learning algorithm which concurrently learns a Q-function and a deterministic policy.The update of the policy is guided by the Q-function.• TD3 [15]: A reinforcement learning algorithm which is designed upon DDPG.It applies techniques such as clipped double-Q learning, delayed policy updates, and target policy smoothing.
• SAC [18]: An off-policy reinforcement learning algorithm trained to maximize a trade-off between expected return and entropy, a measure of randomness in the policy.It also incorporates tricks such as the clipped double-Q learning.• TD3_BC [14]: An offline reinforcement learning algorithm designed based on TD3.It applies a behavior cloning (BC) term to regularize the updating of the policy.• BCQ [16]: An offline algorithm which restricts the action space in order to force the agent towards behaving similarly to the behavior policy.• IQL [24]: An offline reinforcement learning method which uses expectile regression to estimate the value of the best action in a state.
• IL: Imitation learning treats the behaviors in the replay buffer as expert knowledge and learns a mapping from observations to actions by using expert knowledge as supervisory signals.Since our work focuses on addressing complex reward engineering when reinforcing long-term engagement and how to convey human intentions to RL-based recommender systems, we mainly make comparisons with classical RL algorithms.Works like FeedRec [53] emphasizes on designing DNN architecture and is orthogonal to PrefRec which focuses on the policy optimization process.

Evaluation Metric
Among the 10, 0000 users, we sample 80% of them as the training set and the remaining 20% users constitute the test set.For the test users, we store their complete interaction histories separately, in the session-request format.We adopt Normalised Capped Importance Sampling (NCIS) [37], a widely used standard offline evaluation method [12,17], to evaluate the performance.Formally, the score of a policy  is calculated by where U is the set of test users, T  is the set of sessions of the user , ρ (, T  ) is the probability that the policy  follows the request trajectory of the -th session in T  , and L   is the level of change for the -the session (as defined in Eq. 11).Intuitively, J  () awards a policy with a high score if the policy has large probability to follow a good trajectory.

Implementation Details
To ensure fairness in comparison across all methods and experiments, a consistent network architecture is utilized.This architecture consists of a 3-layer Multi-Layer Perceptron (MLP) with 256 neurons in each hidden layer.The hyper-parameters for the PrefRec method are listed in Table 1.All methods were implemented using the PyTorch framework.

Overall Results
We conduct experiments to verify if the framework of PrefRec can improve long-term user engagement in terms of i) session depth; ii) visiting frequency; and iii) a mixture of the both.We randomly sample 500 users from the test set and use them to plot the learning curves of our method and the baselines.As in Fig. 5, PrefRec achieves a significant and consistent increase in cumulative longterm engagement changes in all the tasks, although it does not receive any explicit reinforcement signal.Similar phenomenon can also be observed in some of those generic reinforcement learning algorithms, such as DDPG.They demonstrates growth in specific tasks, though not that stable.The learning curves indicate that the learned reward function is able to provide reasonable reinforcement signals.and the framework of PrefRec provides an effective training paradigm to achieve reward-free recommendation policy learning.Next, we save the models with the best performance in training and test their performance on the whole test set.As in Table 2, those generic reinforcement learning algorithms poorly without explicit rewards.PrefRec is able to outperform all the baselines by a wide margin in all the three tasks, showing the effectiveness of the proposed optimization methods.Despite utilizing only a single dataset, optimizing session depth and visiting frequency are distinct challenges.The results of the experiments demonstrate the generalization ability of PrefRec as it consistently delivers improvements across both tasks.The optimization of session depth and visiting frequency are not necessarily interdependent; the algorithm that produces a deep session may not result in high visiting frequency, and similarly, high visiting frequency does not guarantee a deep session (as seen in Fig. 5).

Training Process of the Reward Function
To ensure the validity of the learning signals generated by the reward function, it must accurately predict the preferences that are utilized in the training phase.To evaluate the performance of the reward function, we have plotted its prediction accuracy in Fig. 6.
The results show a noticeable improvement in accuracy as the training process advances.Additionally, we also plot the learning loss  of the reward function in Fig. 7.The results indicate a substantial decrease in the loss, reducing it to a much lower level compared to its initial value.The improvement in prediction accuracy and the reduction in loss demonstrate that the reward function is becoming increasingly effective at accurately predicting the preferences used in the training phase.This is crucial for ensuring that the learning signals generated by the reward function are reliable and accurate, enabling the model to learn from the preferences effectively.

Ablations
To understand how the components in PrefRec affect the performance, we perform ablation studies on the expectile regression factor and the pre-training/fine-tuning of the reward function.As can be found in Table 8, when  = 0.5 where the expectile regression becomes identical to the regression to mean, there is an obvious drop in performance.The phenomenon indicates that the expectile  regression with a proper  contributes to the improvement in performance.What's more, if compared with the results of DDPG in Table 2, we can find that introducing a separate V-function also benefits the performance since DDPG can be considered an algorithm whose  = 0.5 and without a V-function.Next, we study whether the pre-training and fine-tuning of the reward function is useful.As in Table 9, if we directly update the recommendation policy without pre-training the reward function, the performance decreases.Similarly, training the recommendation policy with a frozen reward function also degrades the performance.The results suggests that the pre-training and fine-tuning of the reward function are important to the performance.

RELATED WORK 6.1 Optimizing Long-term User Engagement in Recommendation
Long-term user engagement has recently attracted increasing attention due to its close connection to user stickiness.One of the major difficulties in promoting long-term user engagement is the lack of consistent and informative reward signals.Reward signals can be sparse and change over time, making it difficult to accurately predict and respond to user needs.Zhang et al. [47] augments the sparse rewards by re-weighting the original feedbacks with counterfactual importance sampling.Wu et al. [42] consider user click as immediate reward and user's return time as a hint on long-term engagement.It also assumes a stationary distribution of candidates for recommendation.In contrast, Zhao et al. [49] empirically identify the problem of non-stationary users' taste distribution and propose distribution-aware recommendation methods.Other researches focus on maximizing the diversity of recommendations as an indirect approach to optimize long-term engagement.For example, Adomavicius et al. [1] propose a graph-based approach to maximize diversities based on maximum flow.Ashton et al. [2] propose that high recommendation diversities is related to longterm user engagement.Apart from diversities, Cai et al. [5] conduct large-scale empirical studies and propose several surrogate criteria for optimizing long-term user engagements, including high-quality consumption, repeated consumption, etc.In this paper we avoid directly learning from the sparse and non-stationary signals inferred from the environment.Instead, we propose to learn a parameterized reward function to optimize the long-term user engagement.

Reinforcement Learning for Recommender Systems
Reinforcement learning allows an autonomous agent to interact with environment and optimizes long-term goals from experiences, which is particularly suitable for tasks in recommender systems [3,9,27,31,46,53].Shani et al. [31] first proposed to employ Markov Decision Process (MDP) to model the recommendation behavior.
The subsequent researches focus on standard RL problems, e.g., exploration and exploitation trade-off in the context of recommendation systems [8,38], together with model-based or simulatorbased approaches to improve sample efficiency [3,32].Recently, there has been works on designing proper reward functions for efficient training of recommender systems.For example, Zou et al. [53] use a Q-network with hierachical LSTM to model both the instant and long-term reward.Ji et al. [21] model the recommendation process as a Partially Observable MDP and estimate the lifetime values of the recommended items.Zheng et al. [51] explicitly model the future returns by introducing the user activeness score.Ie et al. [20] decompose the Q-function for tractable and more efficient optimization.There are also researches focusing on behavior diversity [41,52] as a surrogate for the reward function.Instead of relying on the aforementioned handcrafted reward signals, in this paper we propose to automatically train a reward function based on preferences between users' behavioral trajectories, which avoids the difficulties in reward engineering.

CONCLUSIONS
In this paper, we propose PrefRec, a novel paradigm of recommender systems, to improve long-term user engagement.PrefRec allows RL recommender systems to learn from preferences between users' historical behaviors rather than explicitly defined rewards.By this way, we can fully exploit the advantages of RL in optimizing longterm goals, while avoiding complex reward engineering.PrefRec uses the preferences to automatically learn a reward function.Then the reward function is applied to generate reinforcement signals for training the recommendation policy.We design an effective optimization method for PrefRec, which utilizes an additional value function, expectile regress and reward function pre-training to enhance the performance.Experiments demonstrate that PrefRec significantly and consistently outperforms the current state-of-theart on various of long-term user engagement optimization tasks.

Figure 2 :
Figure2: The framework of recommender systems with human preferences.A teacher provides feedback about his/her preferences between users' behavioral trajectories.Trajectory 1 demonstrates a trend from low-active to high-active and trajectory 2 shows an opposite tendency.Therefore, the teacher will prefer trajectory 1 to trajectory 2. With feedback of preferences, we can automatically train a reward function in an end-to-end manner.The preference-based recommender systems is then optimized by using rewards predicted by the learned reward function.

Figure 3 :
Figure 3: Left: Illustration of the expectile loss function   (). = 0.5 corresponds to the MSE loss, and  > 0.5 upweights positives differences .Center: Expectiles of a Gaussian distribution.Right: An example of expectile regression on a twodimensional distribution.
where  (0) and  (1) are the first and second element of , respectively.After learning the reward model r (, ; ), we can use it to generate learning signals for training the recommendation policy.

Figure 4 :
Figure 4: The proportion of the levels in session depth and visiting frequency.

Fig. 4 .Figure 5 :
Figure 5: Learning curves of RL recommender systems under the framework of PrefRec, averaged over 5 runs.

Figure 9 :
Figure 9: Ablations on reward function pre-training and reward function fine-tuning.

Trajectory 1 Trajectory 2 Preference Reward Learning Reward Function User state Action Actor Critic Teacher Users Recommender System
R is the transition function, where  (  +1 |  ,   ) defines the probability that the next state is   +1 after recommending an item   at the current state   .•R : S × A → R is the reward function. (  ,   ) determines how much the agent will be rewarded after recommending   at state   .

Table 2 :
Overall performance comparisons on various longterm user engagement optimization tasks.The "±" indicates 95% confidence intervals.