Towards Psychologically-Grounded Dynamic Preference Models

Designing recommendation systems that serve content aligned with time varying preferences requires proper accounting of the feedback effects of recommendations on human behavior and psychological condition. We argue that modeling the influence of recommendations on people’s preferences must be grounded in psychologically plausible models. We contribute a methodology for developing grounded dynamic preference models. We demonstrate this method with models that capture three classic effects from the psychology literature: Mere-Exposure, Operant Conditioning, and Hedonic Adaptation. We conduct simulation-based studies to show that the psychological models manifest distinct behaviors that can inform system design. Our study has two direct implications for dynamic user modeling in recommendation systems. First, the methodology we outline is broadly applicable for psychologically grounding dynamic preference models. It allows us to critique recent contributions based on their limited discussion of psychological foundation and their implausible predictions. Second, we discuss implications of dynamic preference models for recommendation systems evaluation and design. In an example, we show that engagement and diversity metrics may be unable to capture desirable recommendation system performance.


INTRODUCTION
Recommendation systems mediate interactions between people and content in the digital world.They attempt to surface content that will be consumed and aligns with individuals' preferences.
In much of recommendation systems research, preferences are implicitly or explicitly assumed to be static throughout time and not altered by recommendation.While useful in many contexts, this assumption has drawbacks.For example, static preferences in user models may account for the poor generalization performance of offline metrics in predicting deployed recommendation performance [22,31,38,53].Furthermore, dynamic preferences have been central to arguments of negative impacts of recommendation, such as polarization, filter bubbles, and extremism [6,23,33,54].
Fig. 1.The methodology derived in this article.First, formulate a psychological theory that should ground a particular effect.Then, formalize it, and discipline it with predictions derived from simulations and data from deployed recommendation systems.This article exemplifies this methodology for three effects: Mere-Exposure Effect, which we introduce using Goetzinger's [60] classical experiment, Operant Conditioning, which we introduce using Skinner's experiments [55], and hedonic adaptation, which we introduce using Brickman's study of happiness after extreme life events [4].The classical experiments are depicted in vignettes on the right.
This study considers the design and validation of dynamic user preference models.We propose that behavioral models should be designed with, at least, two desiderata.
First, they should be grounded in experimental psychological evidence.A wealth of research in experimental psychology identifies several such behavioral patterns, or psychological effects.We contribute a methodology to leverage these results and design psychologically plausible dynamic preference models.We demonstrate this method by designing models to capture three established findings about preference formation and shift in humans: Mere Exposure, [11,16,25,49], where simply experiencing a stimulus (e.g., a piece of content) tends to make humans view it more positively; Operant Conditioning [15,20,26], where preferences shift towards actions that are associated to "positive feedback" (e.g., consuming "better than expected" content); and Hedonic Adaptation [9,46,59], where satisfaction levels (e.g., with content) return to a baseline after a period of time.
Second, models should be verified in simulation for plausibility.For any given effect, there are many ways it could be formalized mathematically.Simulations of user-recommender dynamics can help with these modeling choices.
Preference representations should produce plausible predictions across a range of recommender system designs and initial conditions.For example, a model where behaviour shifts to consuming a single type of content with probability close to one is unlikely to match real observations.Having a combination of predictions from simulations and background from psychology feeds back into improving user modeling and hence recommendation system evaluation and design.This may help address the on-/offline gap in evaluation as well as increase the credibility of the theoretical study of societal impacts of recommendation.
The contributions of this article are threefold: (1) We consider three case studies of behavioral effects due to consumption established in psychological research.
We review the relevant literature of these effects with a focus on their applicability to recommendation systems.
We propose a modeling framework to mathematically formalize these effects.Finally, using simulations, we characterize qualitative properties of our proposed models and derive testable predictions that can be used to test the validity of a particular mathematical formalization independently of the applicability of a behavioral effect. 1   (2) We synthesize the case studies into a broadly applicable methodology for psychologically grounding dynamic preference models.We propose a multi-step procedure which first calls for explicitly stating and providing evidence in support of a psychological effect.Following this one iterates by proposing a mathematical formalization for the behavioral model, deriving testable predictions and comparing them against data relevant to the recommendation context.The proposed methodology is depicted in Figure 1.
(3) Finally, we discuss two implications of this study for recommendation system evaluation and design.First, we critique recent contributions based on their limited discussion of psychological foundations and their implausible predictions.Second, we demonstrate in an example that recommendation system metrics such as engagement and diversity are unable to capture desirable recommendation system performance in the presence of dynamic user preferences.

Outline
The rest of this article is structured as follows.In section 2, we position our work in the related literature.We first derive and exemplify the methodology proposed in this article.In section 3, we consider three psychological effects, and review psychological evidence most relevant to recommendation systems contexts.We mathematically formalize these three psychological effects in section 4. We present experimental results in section 5. Section 6 contains discussion.In subsection 6.1, we formalize the methodology used in this study and discuss how its application may help to refine existing models in the recent literature.In subsection 6.2, we illustrate with an example the brittleness of commonly used metrics in the design and evaluation of recommendation systems.We collect avenues for future work in section 7.
Proofs and additional simulations can be found in the supplementary materials.Code to reproduce our simulations can be found in the supplemental materials.

RELATED WORK
Evaluation.First we relate to literature of recommendation system evaluation.Preference changes in response to recommendation may affect the evaluation of recommendation systems.The main design paradigm of recommendation systems is to use offline train-test splits to design and select recommendations.Several papers argue that offline metrics are poor indicators of online performance [1,22,31,56].Often the lack of external validity of current evaluation methodology is attributed to unmodeled feedback dynamics between users and recommenders [38,53], which further motivates the need to study dynamic preference models.
Societal Impacts.Well-founded dynamic preference models can help resolve apparent contradictions among findings on the societal impacts of recommendation systems.On one side, several studies claim that dynamic interaction of recommendation systems with users may lead to polarization [12,54], filter bubbles [23,50], homogenization [6], echo chambers [14,32,48] and extremism [33,45,52].Empirical audits, however, find that algorithmically recommended 1 While the models we present are classical in psychology, we do not claim that these are models will fit empirical data in recommendation systems well.Our mathematical formalization of theories is particularly parsimonious by formulating preference shifts as changes to vector of user factors.This parsimony, however, leads to simplifications of several potentially psychologically relevant factors, which we do not claim to exhaustively model.First, our models do not differentiate between exposure to and consumption of content.Furthermore, they assume that preferences are purely evolving instead of formed, i.e. users have preferences for even unseen content.In addition, the models are formulated abstractly, as opposed to for a concrete application domain, suppressing features that might make different psychological effects particularly salient in different domains.Our models are exemplary for the application of a methodology, and will need empirical validation before used to model interactions in a real systems.The selection of good models, we claim, however, benefits from inclusion of psychological grounding and extensive testing.content is more diverse than natural consumption [47], and that real systems do not exhibit the strong extremism or polarization effects implied by theoretical models [28,39,40,43,57].The study [5] points out that recommendation systems might lead to undesired changes in preferences, and proposes to design for safe preference shifts, which are preference trajectories that are deemed "desirable".Dynamic Preference Models.Some of the existing dynamic preference models assume that the preferences change independent of the recommendations.Examples are Dynamic Poisson Factorization (DPF) [7,27], which assumes that users and content items have latent representations which evolve as Gaussian random walks; and [37], which models behavioral changes as external concept drifts which affect item quality and average user rating levels.Other works consider the feedback loop between recommendation systems and users directly.[54] models 1-dimensional opinion dynamics on a sphere, [33] and [35] propose models of behavior shift due to content exposure and content consumption, respectively.Each of these theoretical contributions predicts polarization and extremism of user preferences.
Psychology-Informed Recommendation.Recent surveys [30,41] cover ways in which recommenders incorporate and account for psychological effects.The work by Lex et al. [41] focuses mainly on ways in which affective, cognitive and personality factors impact user engagement.In contrast to our work, it does not cover the feedback loop between a recommendation systems and user preferences.[30] reviews psychologically-informed recommenders for behavioral priming and digital nudging; it point out a lack of research into psychological foundations of dynamic preference models.

PSYCHOLOGICAL EFFECTS
This section introduces three psychological effects, Mere-Exposure, Operant Conditioning, and Hedonic Adaptation.
For each effect, we will first introduce a classical example of the effect, give general references, and then provide experiments most relevant to recommendation systems.While we find strong connections for the Mere-Exposure and Hedonic Adaptation effects, we will introduce a third effect, Operant Conditioning, whose experimental evaluation allows for a less straightforward connection to recommendation systems.As a running example in this section, we will consider Alice, a user that initially does not like sports content but it is exposed to it by a recommendation system.

Mere-Exposure
Mere-Exposure or the familiarity effect says that humans tend to like more what they are exposed to more often.A classical experiment, cited in [60], is the Black Bag Experiment.In 1968 at Oregon State University, C. Goetzinger let a person fully covered with a black bag participate in a course.Other students in the course were hostile at first, but later became friendly towards the person covered with the bag.In a recommendation system context, Mere-Exposure may mean that reactions to content become more favorable after exposure to and/or consumption of similar content.In the case of Alice, whose initial preferences are not favoring sports, a Mere-Exposure effect would predict that with repeated exposure or consumption she becomes more familiar with sports content and starts to appreciate it more.
Mere-Exposure effects are well-established in psychology.[3] conducts a meta-study of 200 studies of research in the first three decades following its introduction in [60].Much of the research on Mere-Exposure effects is using experimental setups in which subjects are exposed to non-meaningful stimuli, such as Japanese characters for non-Japanese speakers [13].While this allows to control for prior exposure to the content, such research is not directly meaningful for understanding the Mere-Exposure effect in recommendation systems.
Most relevant to recommendation systems is Mere-Exposure research on advertisement and audiovisual content.
An illustrative example of this kind is [25, Experiment 1].In it the experimenter repeatedly shows users images of fictitious, but plausible, products (e.g., a smoke filter) of different aspect ratios.Users' reported rating of attractiveness of aspect ratios of products increased significantly with the number of times it has had been shown to the user.On average each additional exposure led to a 0.2 points increase on a 7-point scale.(The experiments in [16,49] for the exposure to banner ads and [11] for exposure to pictures of fashion designs made similar findings.)The study found as well that "conspicuous" exposure, i.e. exposure that draws attention to the unfamiliarity of the product, leads to a smaller Mere-Exposure effect (0.03 points increase per additional exposure).

Operant Conditioning
Operant Conditioning is the effect that beings tend to engage more in activities that are associated with positive stimuli (positive reinforcement) and avoid activities associated with negative stimuli (negative reinforcement).In a classical experiment from [55], B.F. Skinner puts a hungry rat into box containing a food dispenser and a button.Food is released whenever the button is pressed.If the rat does not press the button, it gets a small electric shock.As an observation of this experiment, the rat presses the button more and more frequently.In recommendation systems, Operant Conditioning predicts higher engagement with content that was "surprisingly good" and less that is associated with content that was "surprisingly bad".If the initially sports-averse Alice reacts according to Operant Conditioning, depending on her baseline level, she might be underwhelmed by the sports content, and dislikes it more after being exposed to it, or her baseline is very low, in which case she might start liking it.
Operant Conditioning is well-documented in Behavioral Psychology, both in animals and humans.Skinner [17,55] conducted several studies in animals which finds evidence for Operant Conditioning.To our knowledge the first study of Operant Conditioning in humans is [21] which conditioned a young man with a developmental disorder to raise his arm.We refer the reader to the monograph [10, for a treatment of Operant Conditioning and the following behavioral movement in psychology.
While much of the work on Operant Conditioning is conducted in animal experiments, work that most closely resembled a recommendation systems is in consumer psychology.[18] (see surveys [19,20] for follow-up work) introduced the Behavioral Perspectives Model, which classifies consumer choices into several "reinforcers" which might depend on the product-dependent, social, or monetary (shopping online is often costly) consequences of making consumption decisions.As an exemplary experiment, [15] considers the effect of externally provided reinforcers (e.g.shipping cost, shipping duration, or price) on consumer choice among two different online shops.[15] report that a significant fraction of the subjects followed the (positive) reinforcements set by the experimenter.We note that an interpretation of Operant Conditioning for dynamic preference models is challenging.On the one hand, behavioralism views behavior purely as a black box, which is why the effect is framed around behavior, not preferences.Our translation to preferences requires assuming that Operant Conditioning may lead to changes in preferences.Second, the definition of baseline we will consider in the quantitative model below depends on past preferences, which is closer to literature on adaptation, e.g. the quantitative model of [26], than of Operant Conditioning.

Hedonic Adaptation
Hedonic Adaptation is the effect that after some time, any change in happiness fades, and humans return to a baseline level.In a classical study [4], P. Brickman asked lottery winners and paraplegics for their happiness and how they expect their happiness level to be in a year, finding that major life events had negligible effect on their happiness.
Hedonic Adaptation means that engagement with content returns to a baseline level after some time.If Alice's "baseline self" does not like sports content, she will return to this baseline irrespective of the content recommended-which she also would if her "baseline" self likes sports.[34, ch. 16] reviews many earlier findings on the human adaptation to repeated exposure to noise (inconclusive evidence) as well as incarceration and increased income (supporting evidence).
In Hedonic Adaptation literature most relevant to recommendation systems, studies consumption scenarios.[8]'s experiment lets students choose a sticker and attach it to an everyday object.Eliciting self-reported happiness with the sticker, the study finds a 4.5 point decline on a 100-point scale of reported happiness with a sticker in a 3-day interval.
[59] finds a loss between 1.75 point ("low sentimental value") to 7.75 point ("high sentimental value") decrease on a 100-point scale for Google Image search result shown to users for 6 short intervals of 10 seconds.[46] plays songs and asks for reports of happiness with the song at different points.Still on a 100-point scale, the preference is reduced by 15 points from 10 seconds into the song to one minute into the song.Finally [9] showed paintings to subjects in three 15-second exposure intervals.The exposure led to a reduction in happiness with the painting of about 12 points on a 100-point scale.
All of these studies have relatively short exposure times.However, content types are comparable to many contemporary recommendation systems, demonstrated effects on self-reported happiness are quite substantial.

FORMALIZATIONS OF PSYCHOLOGICAL EFFECTS
In this section, we propose mathematical formalizations of Mere-Exposure, Operant Conditioning, and Hedonic Adaptation.We start with our basic notation.
At each round, a recommendation system recommends one of  pieces of content, each with an associated dimensional item vector.Throughout, we assume that the item vectors are fixed. 2Since the item representations are known and fixed, it is without loss to consider a single user at a time.Denote by   ∈ R  the preference vector of the user at time .The user reacts to a piece of content with item vector  ∈ R  according to a rating function, which we assume to be linear, as is common in collaborative filtering,  (  , ) ⟨  , ⟩ +   ∈ R for independent noise   , on which we will make additional distributional assumptions below.At each time step , the recommender chooses a piece of content with associated latent representation   .
In contrast to static recommendation system models, we model user preferences change due to content exposure and consumption:   +1 −   =  (Hist  ) for some user-item history Hist  = ( 1 ,  1 ,  2 ,  2 , . . .,   ,   ), and a potentially random function  .For a static user,  (Hist  ) is constant 0. 3Next, we propose quantitative models for the psychological effects introduced in Section 3.

Mere-Exposure
For Mere-Exposure, we consider the linear model for some  ∈ [0, 1], compare Figure 2a.Whenever an element   is shown, the preference vector moves a -fraction of the line between   and   .This captures the idea that whenever users are exposed to content, this makes them like this content more.
Our model is parameterised by a single quantity that determines how much a user moves into a certain direction.The strength of preference movement might depend, given the psychology literature reviewed above, on the conspicuousness of content exposure.

Operant Conditioning
The space of possible models that capture Operant Conditioning is large, and we consider a particular parameterization, highlighting some of the qualitative features of positive and negative reinforcement.The model we consider here captures that users will adjust their reaction to content that surprises them: The surprise is a function of difference between a baseline level of engagement and the current rating.We model the expected engagement as an discounted average of historical ratings, and use an arctan function to map to the range . This yields the surprise term of the form: The choice of an exponential decay is motivated in psychology and neuroscience, compare, e.g., habituation [44].

Hedonic Adaptation
For hedonic adaptation, we propose a fairly simple model: a linear drift towards a constant baseline preference vector : for  ∈ [0, 1], compare Figure 2d.The update moves the user towards a (fixed) baseline preference  ∈ R  .Note that this behavioral shift is irrespective of the recommended content   .
A shared property of all proposed models is that they do not lead to arbitrarily large preference vectors.where   +   +   ≤ 1.Then for any sequence of recommendations,   is bounded (see proof in Appendix A).

EXPERIMENTS
This section presents simulation results for the formalizations  OC ,  ME , and  HA of Operant Conditioning, Mere-Exposure, and Hedonic Adaptation.We use our results to compare against psychological evidence.

Experimental Setup
User behavior.We sample  i.i.d item vectors from a multivariate normal distribution,   ∼ N (0,   ).The initial user preference  0 is sampled from the same distribution.
The user observes the recommended item and responds with a rating  (  ,   ) = ⟨  ,   ⟩ + .As a result of exposure to and/or consumption of the content the user preference updates to   +1 =   +   (  ,   ) +  ′  , where  ′  ∼ N (0,  ′   ) is zero mean stochastic noise applied to the preference dynamic.We add noise to avoid unstable equilibria states.
Preference Estimation.We make recommendations based on estimates of user preference:   .We initialize the estimate with a random multivariate normal vector:  0 ∼ N (0,   ).Given a recommended item , and observed rating   we update the representation of the user preference estimate   according to Online Gradient Descent (OGD) [24] for the In Appendix B we repeat our experimental setup in the oracle model in which the recommender has direct access to preferences.We find that the qualitative insights from this section hold both in the oracle and in the estimation model; thus the dynamic patterns that we observe are primarily driven by behavioral shift rather than by estimation error.
Item Selection.We consider three baseline recommenders and softmax selection rule with with different temperatures: (1) Baselines: The uniform selection chooses an item uniformly at random from the set of items.The constant selection repeatedly selects the same item for all recommendation rounds.greedy selection picks the item with the maximum predicted score  *  := arg max    , breaking ties randomly.(2) softmax selection: Given predicted scores {  }  =1 and the estimate of the preference vector   , a softmax selection rule with temperature  selects item  with probability4 In each of the following illustrations of preference trajectories, we will depict the long-term preference distribution using a cloud, while the first moves of preferences are depicted by connected dots.

5.2.1
Mere-Exposure.Under this dynamic, users move in the direction of the recommended content, irrespective of how highly they rate it.Figure 3 displays user trajectories in a 2-dimensional preference space for  ME with   = 0.1.
In the case of uniform recommendations, the preferences converge to a ball centered at the origin.For the greedy and constant baseline recommenders, the user preference converges to the latent representation of the item that is repeatedly recommended.With the softmax selection rule, we observe that the long-term distribution of preferences over time traverses the item space.Under selection rules that favor exploration, for instance softmax( = 1), the preference vector moves faster around the item space, yet stays closer to the origin, compared to selection policies that favor expected engagement; e.g.softmax( = 3).
Figure 4 shows how the  parameter affects the engagement of the user, the magnitude of their preference and the the diversity of their content consumption, operationalized as entropy of the distribution of recommended content.
We first observe that engagement may increase due to Mere-Exposure (see the increase of engagement for constant  and increasing  ME ).We further note that a recommender with  = 5 has a very high engagement in particular with high Mere-Exposure.This might raise the question of whether engagement is a valid metric for users with dynamic preference.We will consider this question further in subsection 6.2.
Here we derive the testable prediction that for softmax selection rules and Mere-Exposure dynamics the estimates of the preference vector stay relatively constant over time, and the magnitude of the preference increases with the temperature parameter .

Operant Conditioning.
When the preference shift is governed by Operant Conditioning, we observe that the norm of the preference vector oscillates.This phenomenon can be explained by the type of reinforcement that the Operant Conditioning  OC induces.It is illustrative to analyze the behavior for the greedy recommender.In the beginning, the user is served a recommendation for which she will respond positively.Indeed the surprise term is positive since the expected engagement is 0, and thus the user preference will shift in the direction of the item.With the preference moving towards the item, its score increases and the greedy selection picks it again.The positive reinforcement of preferences from the previous round ensures that the surprise term is still positive; leading to further amplifications in the preference of the direction of the item.Eventually, given the increases in the expected engagement from the previous round, the expected surprise goes to 0. At this time any noise in the response can make the surprise term negative and thus sending the preference vector in the direction of −.As − is nearly diametrically opposed to the preference at this time, even a small negative surprise would considerably decrease the magnitude of the preference vector.However, since the movement is confined to the direction of the original preference vector, the greedy selection will keep recommending it.As the preference decreases in magnitude, so does the engagement of the user.However, as expected engagement is a lagging metric, the surprise term becomes even more negative, thereby creating a downwards spiral which ends with the user getting completely bored and losing their preference (  ≈ 0).After enough time-steps the historical expectation decreases enough such that some other direction (by random chance) has a positive surprise term, commencing again the amplification of the preference in that direction. 5he softmax selection rules are less extreme version of the greedy recommender, yet many of them still show oscillatory patterns.The period and amplitude of these oscillation depends on the softmax parameter  and on the decay parameter in the Operant Conditioning update, .The larger the , the larger is the amplitude of the preference swings and the shorter is the period, compare Figure 6.
The oscillations seen in the simulations are testable predictions of Operant Conditioning (note that Figure 6 shows estimated scores, not unobserved user preferences).The review of psychological literature in section 3 did not show evidence of such oscillatory patterns in consumption under Operant Conditioning, which makes testing this model with data on a deployed recommender particularly important.

Hedonic Adaptation.
Pure Hedonic Adaptation leads to convergence to the baseline point, and does so independently of the recommendation policy.In combination with other effects, Hedonic Adaptation limits the amount by which other effects are perceived.For example when combined with Mere-Exposure (Figure 7a) Hedonic Adaptation provides a strong drift towards the baseline preference; and thus the user preference travels "less" within item space.In joint dynamic with Operant Conditioning (Figure 7b), user preferences still oscillate, but the oscillations are limited to the direction of the baseline preference.In both cases, Hedonic Adaptation biases, but does not overwhelm the (a) Baseline Recommenders: uniform, constant and greedy selection rules Fig. 6.Magnitude of Scores in Higher Dimensions ( = 8,  = 5000): Across behavioral models, higher-engagement selection policies (high ) correspond to more extreme oscillations.When OC effects are large (  = 0.1), the discount factor  has a significant impact on the period of oscillations.Lower  corresponds to a recency bias, where older ratings play a diminished role in forming baseline expectations for engagement, and thus lead to oscillatory patterns of higher frequency.dynamics observed for Mere-Exposure (traveling through preference space) and Operant Conditioning (oscillations).
Biases towards part of the item space are testable predictions of  HA in combination with other behavioral models.

Using Testable Hypotheses to Critique Dynamic Preference Models
The testable predictions of Mere-Exposure and Operant Conditioning are quite distinct when interacting with recommenders.ME predicts that user ratings will be fairly constant even if recommended content changes over time; whereas OC predicts oscillatory patterns in the ratings.These predictions may, or may not, be in line with findings in psychology or observed in deployed recommendation systems.Both types of checks of models allow for further refinement of user behavioral models.
First, the qualitative behaviors found may be challenged, potentially motivating other quantitative models.The oscillation pattern we observed for our quantitative model of Operant Conditioning,  OC is, to our knowledge, not known in psychology.This might rely on the fact that softmax recommendation is a novel type of repeated stimulus unknown in psychology, or might point to a weakness of the model we proposed.Validating such models on deployed system can help select or reject formalizations of models.
Further, refinements of functional forms are also possible.Under  OC , negative reinforcement is stronger than positive reinforcement, as our experiments in this section showed.User preferences in recommendation systems might not exhibit this asymmetry, and motivate new models for Operant Conditioning, e.g., decreasing the slope of the surprise when surprise is negative.Similarly, the fact that the preference lingers around 0 for several time periods suggests that our formulation of users' surprise over-emphasizes early exposure.This suggests a further refinement on our model by reducing the discount factor .

DISCUSSION AND IMPLICATIONS
This article proposes psychologically grounded models and derives their implications.We first discuss our approach to grounding models, and place them into a bigger methodology.We then further investigate our observation that user models might affect recommendation metrics in design.

A Methodology for Psychologically Grounded Dynamic Preference Models
The approach taken in the current article followed several steps, compare Figure 1.
Statement Declare a psychological effect and review relevant psychology literature (section 3); Formalization Formalize the effect within a recommendation system model (section 4); Predictions Inspect properties of the model using theory, simulations, or a combination of the two to derive testable predictions (section 5); Data Test the predictions of the models against historical and/or interventional data in a deployed recommendation system (not performed for this study); Predictions and data may be used to refine the formalization chosen for a particular effect (subsection 5.3).
In addition to the application of the proposed methodology to modelling concrete effects, it may be used to reconsider some models proposed in the recent literature on dynamic preference models in recommendation systems.Next we give three examples of such a discussion.
Example 6.1.In [51] the authors consider and evaluate several dynamic user models.The authors write that "an exponentially weighted moving average [. . .] distribution is obtained within each time window".The paper considers an update   +1 = (1 − )  +   , where the item vectors  are modeled as unit vectors with respect to music genre, and the user factors is a probability vector encoding the likelihood that a user will consume a content from that genre.While this model is functionally equivalent to Mere-Exposure, [51] presents the model in a purely mathematical description.
Our review on Mere-Exposure might allow, for example, to make a model context-adaptive: If users are detected as listening in the background (conspicuous Mere-Exposure),  ME is increased.
Example 6.2.The authors of [5] assume a user model based on a theory of chosen preferences [2].The authors of [5] describe [2]'s (metacognitive) model as: "on a high-level, at each timestep users choose their next-timestep preferences to be more 'convenient' ones-ones which users expect to lead them to higher engagement value."While [2] gives examples of how their models explain several behavioral effects (conformism, closed-mindedness, sour grapes-the psychological effect that things that are unattainable are less liked), among others, [5] does not discuss whether this effect, and the particular cognitive model, is relevant in a recommendation system.Explicit psychological models would have allowed to identify parameter ranges for which users following [2]'s theory resemble Mere Exposure, or another psychological effect.Example 6.3.Having structural models allows critiquing the precise formulation of a psychological effect.[33,35] study preference dynamics in closed loop feedback with recommendations and argue theoretically and via simulations that recommendation systems lead to amplification of preferences and consequently to radicalization of users.In [33] the authors model the effects that repeated recommendation of an item have on preferences.They propose a model which bears similarity to our Mere-Exposure model and concludes that recommendations lead to unbounded preferences.As unbounded interest in a certain type of content is not a plausible prediction, one can critique the proposed user drift dynamics.[35] proposes a model akin Operant Conditioning and argue for the extremization of user preferences by proving divergence of the preference estimates.The structural model of preference update is formalized in such a way that the estimated preference vectors impact the true preferences directly, which might not capture the correct causal relationship between recommendations and user preferences.

Evaluation and Design for Dynamic Users
In this section we consider how dynamic preference models may affect recommendation system evaluation metrics and design.This is based on the observation in section 5 that softmax recommenders were able to attain high levels of both engagement and diversity.In dynamic settings we illustrate with an example that a recommendation algorithm that improves both engagement and diversity might have unintended consequences.
We will measure engagement as the average rating over time and diversity as the entropy of the normalized counts of the consumed items.When user have static preferences the softmax recommender is known to make the optimal trade-off between the expected ratings and the entropy of item selection probabilities, compare, e.g., [29].Theorem 6.4.For any finite set of items   ,  = 1, 2, . . .,  with ratings   , the recommendation distribution (  ) =1,2,..., that maximizes entropy, −  =1   ln(  ) subject to an engagement lower, E  ∼ [ ] ≥ , is softmax() for some  ∈ R. We show that the optimality of softmax no longer holds in dynamic settings by proposing an algorithm which deliberately limits availability of content, yet outperformes softmax both in terms of diversity and engagement.Persistent softmax is sub-optimal for static preferences but performs strictly better in both content diversity and engagement when preferences are dynamic.
recommender.Figure 8 shows the application of a recommendation policy that "nudges" user preferences to shift particularly strongly.The recommendation policy used, and benchmarked against the statically optimal softmax policy for different temperatures is where 1 is the indicator function.One might call (2) a persistent recommender or a recommender with momentum, that only recommends content that is in the half of the space that the user moved to in the last two directions.This recommender, by deliberately changing user preferences, allows for higher entropy of the consumed content.
Recommendation System Evaluation with Dynamic Preference Models.We see that the persistent softmax recommender seems to be preferable to vanilla softmax on both engagement and diversity dimensions.However, this points to a potential gap between the goal of diversity and its operationalization as consumption entropy.The recommendation system deliberately changes users' preferences to increase diversity of consumed content, which arguably is a undesirable property.Hence, metrics for recommendation systems assuming static users, such as ratings for engagement and entropy for diversity might not capture long-term recommendation system health.In practice, algorithms that seems to improve both in terms of engagement and diversity of consumption might be more opaque than the "manipulative" recommender considered here, which requires making explicit trade-offs in metrics.

AVENUES FOR FUTURE WORK
We highlight two avenues for future work.
Estimation.This work gives examples of plausible models describing the effect of consumption of content on user preferences and studies the qualitative properties of these models.The statistically efficient and scalable estimation of user models from empirical data is an important step for future work.Here, we distinguish between estimating the strength of posited behavioral effects in benchmark datasets and designing online experimental setups.The case of statistical estimation from historical feedback sequences of users falls under the category of (in the case of Mere-Exposure and Hedonic Adaptation, linear) dynamical systems learning, the collaborative estimation of item factors might imply significant additional complications.In such cases, the item factor estimates might be biased due to dynamics of the user.
Tensor completion [42] and recent work in the econometrics of state dependence [58] are promising directions for such estimation.The contextual factors affecting the size of dynamic effects, e.g.conspicuousness for Mere-Exposure [36], is another important dimension for estimation.The design of online experiments requires either access to real content recommendation systems or careful use of small scale proxy user studies that can test given behavioral hypotheses.
Evaluation and Design.Our discussion in subsection 6.2 presented an example where in the presence of dynamic user models a recommender with potentially undesirable properties led to higher engagement and diversity proxies than softmax, which is provably optimally trading off these two metrics for a static user.Evaluating recommendation systems and designing for metrics hence requires to include behavioral models.Further studies that propose and investigate recommendation system metrics for users with dynamic preferences are another area for future research.

Fig. 2 .
Fig.2.Updates in preference space for Mere-Exposure, Operant Conditioning and Hedonic Adaptation.Mere-Exposure moves preference vectors a constant fraction of the distance towards the recommended content; Operant Conditioning either moves towards, or away, depending on the direction and magnitude of surprise.Hedonic Adaptation leads to convergence towards a baseline rating.
+1 −   =   OC (Hist  ) =  |surp(Hist  )|(sgn(surp(Hist  ))  −   ) for  ∈ [0, 1], compare Figure 2b for the case of positive reinforcement, sgn(surp(Hist  )) = 1, and Figure 2c for the case of negative reinforcement, sgn(surp(Hist  )) = −1.The magnitude of the preference shift is scaled by the size of a surprise term, and the direction of the change is determined by the sign of the surprise.If the surprise is positive then the preference moves in the direction of the item .Conversely, if the surprise is negative then the preference moves towards −.

Fig. 4 .
Fig. 4. Higher Dimensions ( = 8,  = 5000): Dependence of user engagement, preference magnitude, and diversity of consumption for different softmax  parameters.(a) A higher- softmax leads to higher engagement, which is exaggerated by stronger Mere-Exposure.(b) A very high- softmax (5.0) leads to preferences of high norm.(c) The entropy of consumed content may be high even for moderately high- softmax (4.0).