User Response Modeling in Reinforcement Learning for Ads Allocation

User response modeling can enhance the learning of user representations and further improve the reinforcement learning (RL) recommender agent. However, as users' behaviors are influenced by their long-term preferences and short-term stochastic factors (e.g., weather, mood, or fashion trends), it remains challenging for previous works focusing on recurrent neural network-based user response modeling. Meanwhile, due to the dynamic interests of users, it is often unrealistic to assume the dynamics of users are stationary. Drawing inspiration from opponent modeling, we propose a novel network structure, Deep User Q-Network (DUQN), incorporating a user response probabilistic model into the Q-learning ads allocation strategy to capture the effect of the non-stationary user policy on Q-values. Moreover, we utilize the Recurrent State-Space Model (RSSM) to develop the user response model, which includes deterministic and stochastic components, enabling us to fully consider user long-term preferences and short-term stochastic factors. In particular, we design a RetNet version of RSSM (R-RSSM) to support parallel computation. The R-RSSM model can be further used for multi-step predictions to enable bootstrapping over multiple steps simultaneously. Finally, we conduct extensive experiments on a large-scale offline dataset from the Meituan food delivery platform and a public benchmark. Experimental results show that our method yields superior performance to state-of-the-art (SOTA) baselines. Moreover, our model demonstrates a significant improvement in the online A/B test and has been fully deployed on the industrial Meituan platform, serving more than 500 million customers.


INTRODUCTION
Displaying a mixed feed of organic and advertised items to users is popular for most Internet recommendation platforms [8,20].Typically, users show more interest in organic items in the feed, as they are displayed based on ranking technologies and are more likely to meet users' needs.An excess of advertisements (ads) may harm the user experience [38].Hence, how to allocate the limited slots for organic and advertised items to maximize the platform's overall revenue has become a fundamental problem for recommender systems.The related recommendation algorithms for ads allocation have evolved drastically over the years, starting from fixed slots insertion strategies [18,26] to classic dynamic slots algorithms (e.g., Bellman-Ford, unified rank score) [17,19].[21,40] [3,33] DUQN () Figure 1: (a) Architectures of RL agent with user representation [21,40], RL agent with the auxiliary user response prediction task [3,33], and our proposed DUQN ( please see the detailed structure in Fig. 2); (b) Structure of an ads allocation system.
Recently, RL methods have gained much attention and have demonstrated significant advancements for long-term engagement in recommender systems [2,14,42].Due to expensive online interactions, these methods typically rely on logged data for training [36].For the policy gradient RL, Chen et al. [2] utilize a RE-INFORCE agent to handle a massive state-action space and apply off-policy correction to address the batch learning effect.In [4,22,41], the Actor-Critic (A-C) structure is adapted to learn the recommender policy.For the value-based RL, DRN [43] utilizes the Deep Q-Network (DQN) recommendation framework to achieve a long-term return.CrossDQN [21] proposes a novel DQN framework to dynamically adjust the number and the slots of ads in the feed.DPIN [20] models the page-level user preference and improves the DQN-based agent.
To serve an extremely large user base, improving RL algorithms to utilize the logged user response fully is critical when building recommender agents with large state space (huge item corpus).Different user response modeling approaches have been proposed for the RL recommender agent.The most common methods are to utilize a recurrent neural network (RNN) or Transformer to capture user interest representation in the form of hidden states from historical sequential responses [21,40].To speed up the representation and policy learning efficiency, [3,33] add auxiliary tasks that predict users' responses toward recommendations for training, such as the click-through rate (CTR) prediction, based on supervised and RL signals.However, previous works still faces two challenges.On the one hand, users' behaviors are influenced by not only their long-term preferences but also short-term stochastic factors (e.g., weather, mood, or fashion trends), which are typically neglected by deterministic hidden states in RNN-based user response models [3,40].On the other hand, the user is an active agent with dynamic interests whose responses affect the state transition and environmental rewards during the user-system interaction process, and it is often unrealistic to assume the dynamics of users are stationary [20,21,33].In contrast to previous work, we argue that an RL agent should have the ability to predict user responses and capture the effect of the dynamic user responses on the value estimator.Just as a self-driving car must avoid accidents by predicting where environmental cars and pedestrians are going.Inspired by opponent modeling in the multi-agent setting [12], we consider the user response model as an opponent model and integrate it into the RL recommender agent to improve the recommended performance.
To overcome above challenges, we propose a novel network structure, Deep User Q-Network (DUQN), incorporating a user response probabilistic model into the Q-learning ads allocation policy.The differences with previous architectures are shown in Fig. 1(a).To adequately account for the influence of both long-term preferences and short-term random factors, we utilize the Recurrent State-Space Models (RSSM) [9] to develop the user response probabilistic model.In addition, we design a RetNet [27] version of RSSM (R-RSSM) to support parallel computation.Based on the R-RSSM, multi-step predictions can be further utilized to alleviate the issue of sparse positive responses from users in real logged dataset and facilitate rapid reward information bootstrapping over multiple steps, potentially mitigating the overestimation of Q-values [24].Our contributions can be summarized as follows: • A novel network structure for ads allocation.We introduce DUQN for ads allocation that integrates a user response probabilistic model into a DQN-based agent to capture the effect of the non-stationary user policy on Q-values.• User response modeling.We utilize the RSSM for user response modeling to consider user long-term preferences and short-term stochastic factors, and design a RetNet version of RSSM for model latency optimizing.The multi-step predictions based on the R-RSSM model can further improve the RL ads allocation policy.• Offline and live industrial experiment.Our model performs superior to SOTA baselines in both Meituan offline experiments and the public RL4RS dataset [31].Furthermore, we implement the proposed method in live experiments on the Meituan platform, serving millions of users to showcase its benefit in enhancing overall income.

PROBLEM FORMULATION
Here we consider the ads allocation session in sequential recommendation.As shown in Fig. 1(b), when a user starts a session, the ads allocation server begins to feed  slots on one screen (i.e., target page) from the candidate item pool including the ad and organic item (oi) sequences to the user.As the key component in the environment, the user provides a response according to the displayed screen, such as ordering the recommended item or leaving the current session.Then, the environment returns the next state and a reward to the agent.An ideal recommender agent aims to interactively handle the allocation for each screen in the feed sequentially by maximizing the platform's overall revenue.In general, the ads allocation is formulated as a Markov Decision Process (MDP) (S, A, , , ), and the elements are defined as follows: • State space S. A state  ∈ S consists of item states and user states, where item states include candidate items (i.e., the ads and organic item sequences), and user states include user profile features  (e.g., age, gender), context features  (e.g., order time, order location) and user page-level history response sequences  1: = { 1 , ...,   } (i.e., order, click, pulldown and leave).• Action space A. An action  ∈ A is the decision of whether to display an ad on each slot on the current screen, which is formulated as follows: For each action, the corresponding target page is displayed to the user.Note that we do not change the order of the items within the ads sequence and organic item sequences.
where  is a hyperparameter, and    is determined by the user response   on the target page, taking values of 2 for a click followed by a purchase, 1 for a click without a purchase, and 0 for neither a click nor a purchase.
• Transition probability .As shown in Fig. 1 [20,21,33], the transitions and reward can be redefined as: Then, based on the standard MDP above, one approach for obtaining a RL policy is to optimize the Q function   (  ,   ) ≡ E[      | 0 =   ,  0 =   , ] by solving the Bellman equation with the greedy policy: Standard Q-learning agent [20,21] aims to ascertain an ads allocation policy by finding the optimal Q-values without the knowledge of .Given observed transitions (  ,   ,   ,   +1 ), Q-values are updated iteratively: where  is the step-size parameter.

DEEP USER Q-NETWORK
As users' responses are influenced by their long-term preferences and short-term stochastic factors (e.g., weather, mood, or fashion trends), it is unrealistic to assume the dynamics of users are stationary.In this section, we first analyze the effect of user response modeling on Q-values from the perspective of opponent modeling; then we present RSSM and R-RSSM for user response modeling.

User Response modeling in Q-learning
For the opponent modeling in RL [12], the opponent policy changes over time.And the goal of RL is to learn an optimal policy for the agent, given interactions with the policy of the opponent in the environment.Motivated by this, we aim to find an optimal policy now depending on the non-stationary user policy.The optimal Qfunction relative to the user policy   (  ,   ) is defined as  * |  = max    |  (  ,   ) ∀  ∈ S and ∀  ∈ A. The recurrent relation between Q-values holds: The left part of Eq. ( 6) can be written as      (  |  ,   )  (  ,   ,   ), an expectation over different user responses.Following Q-learning [34], given observed transitions (  ,   ,   ,   ,   +1 ), we can update the Q-values without the knowledge of  recursively: According to Eq. ( 7), the current user response is treated as a joint action, and the user response probabilistic model   (  |  ,   ) is incorporated into the Q-learning to capture the effect of the nonstationary user policy on response-specific Q-values.Note that we can not use the greedy policy for the user response when calculating the Q-value through   = arg max     (  ,   ,   ), since users' behaviors can not be controlled.
As the user's response policy is unknown in Eq. ( 7), we can use the user response model to represent the user policy.We propose the deep user Q-network to model  • |  and   jointly, and the detailed DUQN structure for the ads allocation policy is shown in Fig. 2 ) simultaneously, where || means concatenation.Lastly, the Concatenation Layer's output, combined with each user response  (i.e., order, click, pulldown,leave, and , , ,  for short), is used to predict responsespecific Q values defined by   ,   ,   ,   (i.e.,  (  ,   ,   ), for   ∈ {, , ,  }), which will be summed according to the user response model to get the final Q value relative to the user policy

RSSM for User Response Modeling
Different from the previous CTR prediction auxiliary task based on the RNN model [3,33] or opponent modeling based on Mixture-of-Experts network [12], we employ the RSSM [9] as the user response model to capture user's long-term preferences and short-term stochastic factors.The components are as follows: , is combined with context features,   , and user profile features,   , to form   .This   , together with the deterministic latent state, ℎ  , is then input into the Encoder to generate the posterior of the stochastic latent state   : where MLP donates Multi-Layer Perceptron,    refers to the mean of   ,    represents the standard deviation of   , N ( * , * ) refers to the normal distribution and ∼ represents importance sampling.
Deterministic state model.The deterministic hidden state is handled using a Gated Recurrent Unit (GRU) [6] to maintain longterm memory for key information.Hence, ℎ  can be computed from ℎ  −1 and   −1 as follows: Note that, unlike the RSSM in [9][10][11], we do not employ ℎ  = GRU  (ℎ  −1 ,   −1 ,   −1 ) here.This is because our pages are generated by sorting candidate items based on the corresponding action.Thus    −1 already includes the cross-information of   −1 and relevant ads and organic items.The initial states ℎ 0 and  0 are zero vectors with matching dimensions.
Stochastic state model.By predicting the prior of   using only ℎ  , the agent learns the stochastic dynamics of the user response: where Ẑ   refers to the mean of Ẑ , Ẑ  represents the standard deviation of Ẑ .Once the prior and posterior distributions of   are acquired, we employ the Kullback-Leibler (KL) divergence loss to draw these two distributions closer iteratively.This approach trains the prior and regulates the information volume that the posterior assimilates from   .This regularization strategy enhances the model's adaptability to new inputs   .It also fosters the utilization of ℎ  -derived data for both the user response modeling and representation reconstruction, thereby effectively capturing long-term dependencies [10]: where  donates a batch of transitions.User response predictor.Each page in the user's page-level response sequence is labeled to denote the corresponding user response.We then use both stochastic and deterministic latent states to predict current user response in a purely supervised learning task: Note that we can use Ẑ instead of   for multi-step imagination like the Dreamer [9].The   (  ) represents the predicted probability distribution of user responses: order, click, pull-down, or leave.The loss function for this prediction module is defined as follows: where  ℎ  represents the ground truth of the user response and CE donates Cross Entropy.
Reconstruction predictor.Historical pages, context features, and user profiles hold crucial information such as delivery fee, delivery time, price, ordering time, and basic user details, all of which are vital for extracting user interests.We have incorporated a Reconstruction Predictor to reconstruct these essential features   to prevent data loss and preserve critical information in our model's representations.The corresponding reconstruction loss function is: where x = MLP  (ℎ  ,   ) and MSE donates Mean Squared Error.

RetNet-based RSSM for Latency Optimizing
Although the RSSM has significant advantages in modeling user response, the recursive nature of the GRU hinders the effective parallelization of computations.This is a substantial drawback for internet service platforms requiring immediate responses.RetNet, which also allows for parallel computation, has a lower inference cost compared to the Transformer or GRU [28].As shown in Fig. 2, we further design the RetNet-based RSSM to reduce the overall Compared to Eq. ( 10), the main difference lies in the Deterministic State Model: Here, ℎ  already contains information from   , hence our encoder and stochastic state model become:

Offline Training
We train the RL agent using an offline dataset , collected by an online exploratory policy.During each iteration, a batch of transitions  is sampled from the offline dataset .The model is updated using gradient back-propagation according to the following loss: where  1 ,  2 and  3 are the coefficients, and   denotes the loss of Q function with experience replay: To alleviate the issue of sparse positive responses from users in real logged dataset [5,7,30], we further utilize the predicted probabilities of user responses to calculate the reward  using Eq. ( 2) and apply multi-step Q-learning [29] to enable bootstrapping over multiple steps simultaneously.With the epsilon-greedy behavior policy   and the target policy   , we can use importance sampling to correct the bias in Q-value estimation caused by the difference between   and   .Then, the multi-step version of Q function loss is given by: where (  |  ) .Based on the conclusions in [13], the multi-step DQN without off-policy correction may not negatively affect the model performance.Hence we dismiss the  in Eq. ( 21) for training efficiency in our experiments.
Correspondingly, we can obtain the multi-step version of DUQN by replacing   with    in Eq. ( 19).

EXPERIMENTS 4.1 Experimental Setup on Meituan Platform
4.1.1Datasets.We collected a large-scale dataset including 12 million requests from the Meituan food delivery platform using a random exploratory policy.[20,21], we evaluate the model using the crucial revenue and experience indicators.Revenue indicators include ads revenue and service fees.The ads revenue,   =   , is calculated from advertiser payments using the Generalized Second Price model and charged per click.The service fee,    =    , is a fixed percentage charged on merchant orders.Experience indicators, evaluating user needs fulfillment, include the average conversion rate and user experience score.The conversion rate,   =  ( )/ (), is the order-to-request ratio.The user experience score,   =   / (), measures user demand satisfaction.

Offline Evaluation Metrics. Similar to previous works
4.1.3Hyperparameters.Our method is implemented using Ten-sorFlow, and hyperparameters are determined via grid search.Our model's MLPs have hidden layer sizes set to (128, 64, 32).We a learning rate of 10 −3 , used the Adam optimizer [16], and set the batch size to 8192.In terms of specific hyperparameters, we used  1 = 0.05,  2 = 0.05,  3 = 0.01,  = 2, and  = 0.1 to facilitate effective learning.For more dataset details and hyperparameter analysis, please refer to the relevant content in the Appendix-A.

Offline Experiment on Meituan Dataset
In this subsection, we adopt DPIN [20] as the base representation module.Our proposed model DUQN is trained using offline data, and its performance is assessed through an offline estimator.Our primary inquiries focus on how our method fares compared to established baselines and the influence of diverse designs.

4.2.1
Baselines.We compare our method with 7 representative ads allocation strategies: • GEA [37].A non-RL dynamic ad slot strategy, considering ads intervals and jointly ranking ads and organic items via a rank score.• CTLRL [32].Uses a dual-layer RL architecture for ads allocation, translating platform-level constraints into hour-level and request-level constraints.• HEL-Rec [35].An RL-based approach dividing the recommender system into two tasks and addressing them through hierarchical RL.
• DEAR [39].An RL-based approach using a DQN for dynamic ads allocation across three interconnected tasks.• Cross DQN [21].An advanced RL-based method using crossed state-action pairs as inputs, focusing on slot allocation within a single screen at a time.
• DPIN-A.Current SOTA DPIN baseline incorporates a variety of auxiliary tasks [33] including user response prediction and reconstruction of key information.

Performance
Comparison.Keeping ads exposure consistent across methods for a fair comparison, we analyze offline experimental results in Tab. 1.It can be seen that our method DUQN outperforms the top baselines in both revenue and experience indicators.Specifically, improvements over the best baseline DPIN-A for   ,   ,    ,   , and   are 4.14%, 3.74%, 2.20%, and 2.03% respectively.The offline experiment results indicate that the performance of the method using the DUQN framework and modeling user responses through R-RSSM surpasses that of the DPIN-A method, which employs multiple auxiliary tasks to accelerate learning.We test 6 ablated variants of our approach to assess different components: • -STO: Excludes stochastic factors on user response, means no stochastic state model in the R-RSSM.• -KL: Do not use KL divergence as a loss function to align prior and posterior distributions of the stochastic latent state.• -REC: No reconstruction predictor in the R-RSSM.
• -MULTI: Do not use the multi-step Q function loss.
Fig. 3 displays the performance differences between + and −.From the performance differences, we infer: User Response Prediction.Fig. 3(a) shows the impact of different ablation experiments on the accuracy of user response prediction.First, the page-level user response prediction accuracy for R-RSSM is 79.83%, which is higher than the conventional RNN (i.e.,−   &) with a 71.47% accuracy rate, affirming the efficacy of the proposed framework.Compared with the original RSSM (i.e., −   ), the results indicate that using RetNet in place of GRU does not negatively affect the predictive accuracy of RSSM.On the contrary, thanks to the enhanced representational capabilities of RetNet, R-RSSM achieves more accurate predictions.Then, we further analysis the impact of different modules in R-RSSM.The gap between + and − underscores the benefits of accounting for the stochastic factors in user responses, which enables the model to capture the dynamics of user responses better.The difference between + and − demonstrates that by preserving some key information, we have achieved a more precise characterization of user response patterns.The distinction between + and − confirms that the constraint of the KL divergence loss leads the model to imagine a distribution of Ẑ that is closer to the  , aiding in multi-step predictions.
Average cumulative reward.Fig. 3(b) illustrates the impact of different settings on the average cumulative reward.The difference between +   and −   reveals that the multi-step approach is better suited for recommender system tasks where rewards are relatively sparse.Meanwhile, the distinction between other + and − demonstrates that our DUQN framework can achieve superior performance when it obtains more accurate predictions of user responses.
Training speed.Fig. 3(c) displays the difference in training speed between +   and −   .It is evident that due to RetNet's parallel computation capabilities, there is a significant increase in model training speed, proving that utilizing RetNet to accelerate computation is effective.We also investigate the impact of ±   on the training speed and observe that, owing to the simplification of   : +−1 in Eq. ( 21), there is no significant difference between the two.
The performance gaps in Tab. 1 between all + and − configurations validates the new framework's design, suggesting each component -stochasticity, KL loss, reconstruction, multi-step, and RetNet are effective for the overall performance.4 illustrates the ΔCTR, ΔCPM, ΔRPM, and ΔGMV in a consecutive period of 21 days on Meituan Food Delivery Platform during the online A/B testing period.We find that our method effectively enhances platform revenue and user experience, as evidenced by increases of 1.04%, 1.01%, 0.25%, 1.05% in ΔCTR, ΔCPM, ΔRPM, and ΔGMV respectively.The elevation in CTR serves as an indicator of heightened user engagement with the recommendations provided by our model.Consequently, this phenomenon culminates in an overarching augmentation of advertising revenue and platform transaction volume.We continuously observed for 21 days and calculated the average time consumption for both.To compare the impact of RetNet on reducing online latency, we also collected online latency data from the RSSM model.In Tab. 2, we compared the online inference speed between our R-RSSM model and the DPIN-A and found no significant difference.The difference in online inference speed between the R-RSSM and RSSM models sufficiently demonstrates the effectiveness of our latency optimization strategy.In consideration of the substantial benefits introduced by the new model, a slight increase in latency is deemed acceptable.4.4.2Baselines.We use HAC [23] as our backbone and Fig. 5 how our method is utilized within the A-C framework.
• DDPG-RA.An advanced Deep Deterministic Policy Gradient (DDPG) framework incorporating action representation learning [1].• HAC.Current SOTA RL baseline on RL4RS dataset.A comprehensive Hyper-Actor-Critic framework that employs advanced techniques to bridge the hyper-action and effective action spaces, using a scoring function, critic network, and an inverse mapping module, ensuring optimal recommendation lists from a vast item pool [23].To maintain control over variables and facilitate comparison, we adopted the methodology in [23] to train a user response model that simulates online user interaction.Subsequently, the reward for our current recommendation results is computed based on the predicted user response.

Evaluation.
We split all datasets 75-25% for training and evaluation based on timestamps.The online environment is pretrained on the training set and the entire dataset for subsequent evaluation.Recommendation policies are trained in the first environment and assessed in the second.We employ a reward discount of  = 0.9, cap interaction depth at 20, and all RL methods stabilize within 50,000 iterations.We measure Total Reward and Depth.The former represents platform revenue, while the latter represents user experience.Higher values for Total Reward and Depth signify better performance.Results are averaged across user sessions.
4.4.5 Result.For every task, the recommender system aims to optimize the prolonged contentment, denoted by the cumulative reward and mean depth.From Tab. 3, it is perceptible that our method surpasses the optimal baseline, registering significant improvements in both overall rewards and user experience, with increments reaching 15.53% and 11.66% respectively.The online testing results shown in Fig. 6 further demonstrate that, compared to HAC, our method exhibits stronger stability in online testing, with the mean of the variance in average cumulative rewards being 1.80, which is less than that of HAC at 3.18.This demonstrates that our method is also applicable to recommender tasks with continuous action space.

CONCLUSIONS
In this paper, we propose DUQN that integrates user response modeling into the DQN-based recommender agent.The RSSM is employed to predict user response and capture users' long-term preferences and short-term stochastic factors.Furthermore, a Ret-Net version of RSSM is designed for latency optimizing, and multistep predictions based on R-RSSM can further improve the RL policy.Practically, offline and online experiments have demonstrated our solution's superior performance and efficiency.We share practical lessons and industrial experience in user modeling, computational considerations, and so on.

Figure 2 :
Figure 2: Structure of DUQN.After generating the corresponding feature vector in the representation module, the R-RSSM predicts the user response probability distribution on the target page.Then DUQN outputs the response-specific Q values   ,   ,   ,   , which will be summed according to the probability of user response model   to get the final Q value  • |  .latency of the model.For simplicity, we denote it as R-RSSM.To the best of our knowledge, this is the first time RetNet has been applied to RL agents in recommender systems:

Figure 3 :
Figure 3: The experimental results on ablation study.4.3Online Deployment on Meituan PlatformTo further assess the real-world performance, we conducted an online A/B test by randomly assigning 20% of users to our DUQN model and another 50% of users to the current SOTA model DPIN-A.This test took place from September 5, 2023, to September 25, 2023, spanning a 21-day period.Our model consistently outperformed DPIN-A during the 21 days, and it has been deployed for the full user base now (more than 500 million customers).

4. 3 . 1
Online Serving.In appendix, we outline the online serving process of our method.The agent determines the action that offers the maximum Q-value based on the current state and R-RSSM model, then converts this action into a set of ads slots for display.As the user scrolls down, the model captures the state of the next screen, enabling informed decision-making based on the information shown on the next screen.As real user responses are unknown, we will use the predicted   (    ,   ) instead.

Figure 5 :
Figure 5: DUQN within the A-C framework 4.3.3Online Results.Fig.4illustrates the ΔCTR, ΔCPM, ΔRPM, and ΔGMV in a consecutive period of 21 days on Meituan Food Delivery Platform during the online A/B testing period.We find that our method effectively enhances platform revenue and user experience, as evidenced by increases of 1.04%, 1.01%, 0.25%, 1.05% in ΔCTR, ΔCPM, ΔRPM, and ΔGMV respectively.The elevation in CTR serves as an indicator of heightened user engagement with the recommendations provided by our model.Consequently, this phenomenon culminates in an overarching augmentation of advertising revenue and platform transaction volume.We continuously observed for 21 days and calculated the average time consumption for both.To compare the impact of RetNet on reducing online latency, we also collected online latency data from the RSSM model.In Tab. 2, we compared the online inference speed between our R-RSSM model and the DPIN-A and found no significant difference.The difference in online inference speed between the R-RSSM and RSSM models sufficiently demonstrates the effectiveness of our latency optimization strategy.In consideration of the substantial benefits introduced by the new model, a slight increase in latency is deemed acceptable.

Figure 6 :
Figure 6: Online testing logs for DUQN and HAC.
to next state   +1 upon taking an action   and a user's response   .When the user scrolls to the first item on the subsequent page, state   transitions to next state   +1 .If the user leaves, the transition terminates.Therefore, the transition probability can be described as  (  +1 |  ,   ,   ).• Discount Factor .The discount factor  ∈ [0, 1] maintains equilibrium between short-term and long-term rewards.Note that the user's response   affects the state transition  (  +1 |  ,   ,   ) and environmental reward function R (  ,   ,   ).Assuming the users' dynamic model is a stationary policy   (  |  ,   ) . The agent uses varied state-action inputs (i.e., context features, user profile features, page-level historical response sequences, and a target page originating from candidate items and action) for the representation module.The representation module first generates page-level representations    , context feature embedding   , and user profile feature embedding   .Then the RSSM model is used to obtain the user response probabilistic model on the target page   (  |  ,   ) and reconstruct representation features x from   = (  ||  ||

Table 1 :
Performance for various models on Revenue Indicators and Experience Indicators.Results are displayed as mean (± standard deviation).Improvement denotes the enhancement of our method compared to the top-performing baselines.