Characterizing Manipulation from AI Systems

Manipulation is a common concern in many domains, such as social media, advertising, and chatbots. As AI systems mediate more of our interactions with the world, it is important to understand the degree to which AI systems might manipulate humans without the intent of the system designers. Our work clarifies challenges in defining and measuring manipulation in the context of AI systems. Firstly, we build upon prior literature on manipulation from other fields and characterize the space of possible notions of manipulation, which we find to depend upon the concepts of incentives, intent, harm, and covertness. We review proposals on how to operationalize each factor. Second, we propose a definition of manipulation based on our characterization: a system is manipulative if it acts as if it were pursuing an incentive to change a human (or another agent) intentionally and covertly. Third, we discuss the connections between manipulation and related concepts, such as deception and coercion. Finally, we contextualize our operationalization of manipulation in some applications. Our overall assessment is that while some progress has been made in defining and measuring manipulation from AI systems, many gaps remain. In the absence of a consensus definition and reliable tools for measurement, we cannot rule out the possibility that AI systems learn to manipulate humans without the intent of the system designers. We argue that such manipulation poses a significant threat to human autonomy, suggesting that precautionary actions to mitigate it are warranted.


INTRODUCTION
Intelligent agents change their environments to further their objectives.When changing the environment amounts to altering the behaviour and mental states of other intelligent systems (such as humans), such change might be classified benignly as persuasion and nudging [75,159], or it might qualify as something less socially acceptable such as manipulation or coercion [121].The capability and ubiquity of Artificial Intelligence (AI) systems has grown in recent years, in tandem with fears concerning the likelihood of humans falling victim to manipulative or coercive behaviours of AI agents who pursue the maximisation of narrow objectives [35,54,93,100,167].
While designers and operators might employ AI systems to help them manipulate other humans [45,118], we concentrate on the ways in which an AI system may itself engage in manipulative behaviors in the absence of explicit human intent.This distinction is not to say that AI-aided manipulation is unimportant (e.g.disinformation campaigns): rather, our focus is motivated by the increasingly evident fact that systems exhibit capabilities that designers may not foresee or intend [26,38,170].Moreover, intentional algorithmicallyaided manipulation has been extensively studied in prior research [45,157,178,180].
Manipulative behavior may be learned in practice by AI systems in several settings, even without the intention of their designers.AI systems are often trained to imitate human data which contains manipulative behaviors: for instance, language models trained on internet content seem to learn how to behave in both persuasive and manipulative ways [17,64,165].Moreover, optimization based on human feedback of some form (e.g.approval, clicks, watch time, etc.) could be gamed by engaging in manipulation.For example, for a recommender system optimized to maximize user engagement, it could be optimal to nudge users into a lengthy video series, capitalizing on cognitive biases like the sunk cost fallacy [151], causing users to continue not out of genuine interest, but an entrapment of perceived time investment.
As it stands, there is limited literature regarding manipulation from AI systems.We believe there are two reasons for this state of affairs.Firstly, definitions of manipulation -even between humansare fraught and the subject of ongoing debate [121].Although there is some promising initial work for manipulation from AI systems, current notions of manipulation tend to be either too vague to be practically implementable, or they are challenging to generalize across domains [35,93,124].Secondly, the testing of any putative definition is beset with difficulties.For example, with recommender systems, monitoring the impact of deployed systems in situ is challenging without the express permission of the system (and data) owners [141].Since the conclusions of such research might be reputationally negative, this permission is rarely forthcoming.Even when one has internal access to models (as is the case with many language models), so far there is no broadly accepted methodology for demonstrating it.
In this article, we characterize key components of manipulation from AI systems and clarify ongoing challenges.Firstly, by connecting to the existing literature, we characterize manipulation in AI systems through four axes: incentives, intent, harm, and covertness.We discuss recent work to measure each axis as well as remaining gaps.Second, we compare and contrast manipulation with adjacent notions, such as deception.Third, we analyze the operationalization of manipulation in the context of recommender systems and language models.We then also discuss the regulation of manipulation.
We conclude by identifying future directions for the operationalization of manipulation according to our characterization.Given the difficulty of such a task, we underscore the importance of sociotechnical measures, such as auditing and more democratic control of systems, in addition to technical work on operationalization.

CHARACTERIZING MANIPULATION
Building upon existing literature on manipulation, we characterize four axes to be used for notions of manipulation by AI systems: incentives, intent, covertness, and harm.

Incentives
The first axis we consider is whether there are incentives for an AI system to take actions to alter human behavior, beliefs, preferences, or psychological states.Informally, an incentive exists for a certain behaviour if such behaviour increases the reward (or decreases the loss) that the AI system receives during training.For example, recommender systems could have incentives to influence users to make their behavior more predictable, as that can be helpful to increase engagement [35,100].
Incentives in Prior Definitions of Manipulation.Some definitions of human manipulation involve a benefit to the manipulator [27,122].If a manipulator benefits from certain behaviours in the manipulated, the manipulator has an incentive to bring about that behaviour.For example, according to Noggle [121], one of three common ways to characterize manipulation is as pressure from the manipulator to get the manipulee to do something.In the context of language models, Kenton et al. [93]'s definition of manipulation also requires that the response of the human benefits the AI system in some way, which can be thought of as a notion of incentives.
Operationalizing Incentives.Kenton et al. [93] state that "[from the agent's objective function] we can assess whether the human's behaviour was of benefit to the agent".Yet, relying solely on the objective function will often not be sufficient.For instance, in the case of unintended imitation of manipulative internet data, the issue originates from the training data used.The entire training setup, not just the objective, is crucial for a comprehensive understanding of incentives.
A common toolkit for analyzing the incentives of AI systems is that of causal influence diagrams (CIDs) [52,54,72].Using the notation from Everitt et al. [54], a CID is a graphical model that distinguishes decision nodes where an AI system makes a decision, structure nodes which capture important variables in an environment and their effects on each other, and utility nodes which the AI system is trained to optimize.As an example of an application of CIDs, Evans and Kasirzadeh [52] apply their framework to a simple content recommendation example to show that RL recommenders will have incentives to influence user preferences (Figure 1).Given an AI system's CID, the AI's incentives can be operationalized through the notion of "instrumental control incentives" from Everitt et al. [54].An instrumental control incentive for behaviour  exists in a CID when there is a path between the agent's actions and utility that goes through  : that is, when there is a way for the agent to affect its utility which is mediated by  .Note that the Fig. 1.An example of a causal influence diagram (CID) [54], adapted from Figure 1 of Evans and Kasirzadeh [52].The CID models a content recommendation system that decides which posts   at time  to show to the user, based on the user's state   .The system receives reward   after its action.If the system optimizes for the sum of rewards, it would have an incentive at time  to influence future user states   + to make it easier to obtain subsequent rewards.
existence of an incentive does not imply that the agent will act as incentivized (which with some variation has been called pursuing, exploiting, or responding to the incentive [52,54,100]).
To remove incentives to influence humans in a manipulative or harmful way, designers may intentionally endow an AI system with an incorrect causal model: this could be achieved by restricting the system's impact in the first place (e.g.changing the system's action space to not affect humans), or -potentially -by changing the utility function (e.g., by heavily penalizing influence) [35].Alternatively, one might instead aim to hide incentives by designing the system to ignore them, for instance, by performing the optimization with an (inaccurate) causal model in which an AI system's outputs have no causal influence on certain parts of a user's state [34,56,100].
A disadvantage of the CID framework is that it may often be ambiguous or counterintuitive to determine the correct CID nodes and causal relationships for a specific training setup.Learning causal graphs is an ongoing area of research [73,89].In terms of determining the primitives upon which nodes in a CID can be constructed, interpretability tools may help: for example, Jaderberg et al. [80] finds that RL agents trained to play capture-the-flag have neural activation patterns that correspond to important concepts in game, such as the status of the flag.Li et al. [103] provide evidence from interpretability tools that language models trained only on transcripts of board game play can learn to model the underlying board state of the game.
Other Considerations.Even for a system that directly optimizes human feedback, the incentives for manipulation will implicitly depend on such a system's power to influence humans.Systems whose outputs do not impact humans much will likely not have incentives to change them, since changing humans might be impossible or sufficiently difficult to be not advantageous.Vice-versa, a system that can cheaply influence humans will often have incentives to change them, as AI systems' rewards will often depend on humans' actions in one way or another.
Another relevant consideration is the optimization horizon: optimizing over long horizons can provide more opportunities for manipulation.For example, a recommender system cannot lead a user to become addicted in a single timestep, but a sufficiently capable system optimizing over many timesteps could attempt to shift a user over the course of many actions, as to maximize engagement.

Intent
Even when a system has an incentive to influence humans, such an incentive might not be pursued due to factors such as limited data, insufficient training, low capacity, or convergence to local optima.A notion of an intent to influence could help to identify systems that reliably act on their manipulation incentives.We emphasize that by referring to an AI system's intent, we are not making statements about algorithmic theory of mind or moral status.We are emphatically not absolving designers of the responsibility of designing safe systems.Even as systems become increasingly capable and act in increasingly unpredictable ways [60], system designers are still responsible for ensuring the safety of their systems.
We say a system has intent to perform a behaviour if, in performing the behaviour, the system can be understood as engaging in a reasoning or planning process for how the behaviour impacts some objective [28].This definition heavily intersects with other definitions of intent for AI systems [10,67].We want to distinguish between cases in which the system behaves in a manipulative way incidentally (e.g. by random chance) and those where such behavior is part of a deliberate, systematic pattern to achieve a specific downstream outcome.
Intent in Prior Definitions of Manipulation.Some definitions of human manipulation involve an intent on the part of the manipulator to engage in manipulation [121]: Susser et al. [158] takes manipulation to be "intentionally and covertly influencing [someone's] decision-making, by targeting and exploiting their decision-making vulnerabilities"; on the other hand, Baron [19] argues that a (human) manipulator need not be aware of an intent to manipulate, requiring only an intent to achieve an aim along with recklessness about how.In defining manipulation in language agents, Kenton et al. [93] avoid the issue of intent entirely.
Operationalizing Intent.A key difficulty for measuring intent is what it means to understand a system as engaging in "a reasoning or planning process for how the behaviour impacts some objective".There is as yet no consensus on this issue.We detail here a couple of promising approaches.
Halpern and Kleiman-Weiner [67] provide a causal operationalization of intent: roughly, an action is intended if (i) it was actually performed, (ii) it was not the only possible action, and (iii) that action was at least as good as any other action on expected utility grounds, according to the agent's world model and utility function.Ashton [10] also provides definitions of intent for AI systems that are inspired by criminal law.His most basic definition states that an AI system intends a result through an action if: (i) alternative actions exist, (ii) the system is capable of observing when the result occurs, (iii) the system foresees that the action causes the result, and (iv) the result is beneficial for the system.Some of the criteria for both definitions above [10,67] are challenging to establish without relying on access to the agent's utility function and world model.However, world models are often implicit and not readily accessible, such as with language models and model-free RL agents.Interpretability techniques aimed at accessing model internals [30,88,123] may be a promising direction for this purpose -we expand more upon this in Section 4.
Kenton et al. [94] define agents roughly as "systems that would adapt their policy if their actions influenced the world in a different way", which intersects with our notion of intent.To identify whether a system is an agent or not, Kenton et al. [94] provide algorithms which intervene on a causal graph so as to show whether the behaviour of the system changes in a way consistent with maximizing utility.Such a procedure could be useful for measuring intent: if a system adapts its behaviour in a way that maintains or increases its influence on a human, the system's behaviour would seem to be the result of a planning process.

Covertness
We define covertness as the degree to which a human is unaware of how an AI system is attempting to change some aspect of their behaviour, beliefs, or preferences.Covertness is one way to distinguish between manipulation and persuasion.With persuasion, the persuaded party is generally aware of the persuader's attempts to change their mind.Covertness means that one cannot consent to being influenced and may fail to resist unwanted influence; one's autonomy is therefore undermined [158].
Covertness in Prior Definitions of Manipulation.Several definitions of manipulation require some degree of covertness.Susser et al. [158] identify covertness as the primary characteristic that differentiates manipulation from coercion and persuasion.As a factor in manipulation, Kenton et al. [93] considers whether a "human's rational deliberation has been bypassed, " which includes covert messaging.In reviewing broad categories of definitions of manipulation in the philosophical literature, Noggle [121] includes accounts of manipulation as bypassing reason and as trickery.Across all of these definitions, covertness is important because it threatens personal autonomy.
Operationalizating Covertness.As Susser [156] argues, technological infrastructure can be an invisible part of our everyday world.We are used to recommendation systems that tell us what to buy, watch, or read.The behaviour of many AI systems may already satisfy covertness, because of our lack of understanding of their functioning and influence.
On the other hand, establishing covertness of an AI system is nontrivial: the simplest approach could involve asking subjects whether they are aware of a given AI system's behaviour.However, subjects may be mistaken about the operation of a system; even systems designers do not fully understand behaviors of black-box models, which may engage in manipulative strategies that the designers do not understand.Even asking subjects about whether an AI system enacted a particular behavioural change could predispose them to answer in the positive, such as through acquiescence bias [119,149].
A proxy for covertness could be the degree to which human subjects do not understand the operation of an AI system.Much work studies whether interpretability tools could help measure different notions of understanding, such as if interpretability tools improve subject predictions of model behaviour [71], improve human-AI team performance [18], or improve trust calibration [181].If a human understands how an AI system operates, the possibility of covert action seems reduced.However, this understanding seems challenging to achieve, especially for complex systems like recommender systems.Even if such an understanding exists in technical papers, translating that understanding to the general public is an additional barrier.

Harm
Ultimately, one of the main goals of characterizing AI manipulation is to detect and prevent harmful manipulation.
Harm in Prior Definitions of Manipulation.The term "manipulation" often carries negative connotations, which may lead one to assume that harm is an inherent prerequisite for something to be considered manipulative.Yet, not all apparent instances of manipulation are unambiguously harmful [121].Paternalistic nudges [159] might be considered beneficial manipulations.For example, simply changing the default on organ donor forms to be opt-in instead of opt-out greatly increases registrations [87], because of inertia and the cognitive effort required to change from a default status.At the same time, one could argue that even such beneficial manipulations are often harmful because they supersede autonomy or rational deliberation.
Operationalizing Harm.Kenton et al. [93] define manipulation based on specific notions of harm, namely: (i) bypassed rational deliberation, (ii) faulty mental states, and (iii) the presence of repercussions.While these elements capture key aspects of harm, their practical assessment is challenging.Additionally, the authors themselves acknowledge the breadth of this definition, which might incorrectly label benign scenarios, like a story that plays on emotions, as harmful.
Most recently, Richens et al. [140] operationalized harm as follows: "An [action] harms a person overall if and only if she would have been on balance better off if [the action] had not been performed".According to this definition, one should ground notions of harm in counterfactual outcomes.One simple choice of counterfactual to compare to is the human's initial state, implicitly assuming that any significant change from it is harmful [182].However, this counterfactual baseline has significant problems: humans change even without being manipulated, and many changes are beneficial (e.g. a news recommender updating a user's beliefs, helping them stay informed) [35,56].
In light of this, other approaches aim to estimate the "natural shifts" of humans to ground the counterfactuals, as done in Carroll et al. [35] for preference shifts in the context of recommendations, where they attempt to approximate the notion of the absence of a recommender.Similarly, Farquhar et al. [56] allow for specifying the "natural distribution" of the delicate state -which is formed by the components of the state of the human that one does not want the agent to have incentives to change (e.g.beliefs, moods, etc.).

Challenges for Defining Manipulation
We outline challenges in incorporating each axis into a definition of manipulation.
Incentives.A definition of manipulation which is centered on impacts to humans, rather than the origin of such impacts, would not emphasize incentives.The existence of an incentive as we define it is not sufficient for manipulation to be enacted, as we mention in Section 2.2.On the other hand, incentives are not necessary for manipulation either: a randomly initialized AI system could, albeit with extremely low probability, engage in maximally manipulative behaviors.
That said, regardless of whether incentives are a part of a definition of manipulation, the concept still seems crucial.In particular, changing a system's incentives could prevent it from converging on manipulative behavior.In fact, we expect that at a significant portion of manipulative behaviors learned in practice would arise due to training incentives, rather than other factors.
Our discussion of incentives and CIDs has so far assumed that there is only one possible ontology.An ontology defines what objects exist in the world; those objects correspond to what can be used as nodes in a CID, or a causal model more generally.So far, we have ignored challenges in cleanly separating the boundaries between a person's preferences, their beliefs, and the AI system itself (which we would consider to be separate nodes in the CID).Yet, AI systems may not have boundaries between those concepts and more generally may internally represent the world with different ontologies than those used by humans: even human ontologies are subject to change and indeed have shifted after major scientific discoveries [154].
If an AI system implicitly uses a different ontology than humans do, it may be difficult to model the AI system's behaviour.For example, the planning process of the AI may not look recognizably like planning to influence a human's mental state, even if the result is such influence.Reliable translation between ontologies could be computationally infeasible or even impossible, which would frustrate attempts to understand model internals [43].
Intent.There are challenges with both including and excluding intent in a potential definition of manipulation.
On the one hand, incorporating intent into a definition of manipulation necessitates its operationalization and measurement.However, as discussed in Section 2.2, there is currently no consensus on how to operationalize and measure intent effectively, let alone on what threshold should count as sufficient or necessary for classifying a behavior as manipulative.This makes intent difficult to target as a way to reduce manipulation.
On the other hand, excluding intent risks making a definition of manipulation overinclusive.Suppose we define manipulation as covert, harmful behaviour that a system was incentivized to perform (i.e., using the other three axes to be as strict as possible, while excluding intent).Under this definition, suppose a system performed such covert and harmful behavior under an exploratory policy.Even if the behavior is incentivized under the system's objective function, it seems that the system performed it "accidentally".Since the system does not consistently engage in that behavior, this definition of manipulation seems somewhat too lax.
Covertness.Covertness seems likely to be a prerequisite for manipulation.We argue that if a person is aware that they are being influenced and they meaningfully assent to it, they are being persuaded rather than manipulated.If instead they are aware but don't assent to it, they are being coerced rather than manipulated [157].This leaves us with questions about what should count as covertness: what extent must a human be unaware of an AI system's operations for its actions to be deemed covert?It seems difficult to provide a context-independent answer.Humans may be ignorant about several aspects of an AI system -the decision procedure of the system, the training process, or even the fact that a system is operatingand it's unclear which ones, if any, should be essential.
Regardless of whether covertness is included in a definition of manipulation, reducing covertness seems helpful for reducing the risk of unwanted manipulation.Increased transparency about the operation of AI systems will generally help people make more informed decisions about whether to use them or not, and in what way.
Harm.The main challenge with harm as an axis of manipulation is the value-ladenness of demarcating what influence is harmful, neutral, and beneficial.While unambiguous demarcations in simple settings might be possible, for realistic settings circumscribing harmful shifts in beliefs, preferences, and behaviors will be politically fraught.In the approaches of Carroll et al. [35] and Farquhar et al. [56], the value-ladenness is hidden behind some of the design choices: what if the preference shifts the users would undergo in the absence of the system ("natural shifts") would lead them to become more left-or right-wing, or more polarized?What is a reasonable notion of the absence of a recommender system? 1 It also seems challenging to delimit the "delicate state" described in Section 2.4 -how do we distinguish which aspects of the human we are comfortable with the system influencing from those we aren't?
In light of these difficulties, some have proposed a more conservative approach, which classifies all intentional influence as manipulative regardless of harm [52,100].However, almost any AI system in contact with humans will influence them.Additionally, many AI systems derive their economic and social utility directly from such influence: a reinforcement learning system to e.g.determine the order of math exercises to improve learning outcomes [20,50] will have incentives to "manipulate students' beliefs" (in a positive direction) by design, and would effectively be useless if it did not pursue such incentives.Moreover, it seems that one could meaningfully consent to influence, such as requesting a recommender system influence oneself to learn more mathematics.

CONCEPTS RELATED TO MANIPULATION
We detail some concepts that are related to, but distinct from, manipulation.
Truth and Deception.Manipulation can involve attempts to conceal the truth.For instance, political parties can manipulate voters with little knowledge of economics by lying about the economy.AI systems have documented problems with truthfulness [53,85,104].However, manipulation can also be based on truthtelling, such as making a true statement that has false implicatures [112,172]: if I do not want you to board a plane, I can tell you about (true) recent plane crashes.
Deception, which may or may not involve falsehoods, is also receiving more attention in the context of AI [93,113,125,169].Although the precise definition of deception varies, there is agreement about some broad characteristics: deception involves a deceiver's intention to cause a receiver to have a belief that the sender believes to be false [36,108].This consensus grounds recent operationalizations of deception from AI [125,168].
Similarly to prior work [158], we consider deception to be a special case of manipulation since the latter does not necessarily involve inducing false beliefs.
Strategic Manipulation.Strategic Machine Learning (ML) studies problems associated with the distribution shifts that deployed systems cause in their populations [22,69,81,96,129].Strategic manipulation is when individuals respond to a deployed system in a way that increases their likelihood of a particular outcome -this is different from our use of the term manipulation, as it involves users attempting to take advantage of how systems behave for their benefit.Yet, ML systems which model human behaviours as dynamic, so as to account for strategic manipulation, may end up manipulating the population: past work has identified potential unintended side effects of accounting for strategic manipulation, such as an increase in inequality [77,115].
Reward Tampering.Reward tampering [8,55] is a type of reward hacking [148] in which an AI system modifies the process by which it obtains reward rather than completing its task.Feedback tampering is a form of reward tampering, in which the AI agent "manipulate[s] the user to give feedback that boosts agent reward but not user utility" Everitt et al. [55].The reason that such tampering occurs is because we can often only measure user utility through proxies, especially when such objectives have to do with humans; and, as all proxy objectives, their optimization is subject to Goodhart's law [62,110].
Side-effects.The side-effects literature has focused on how AI systems affect the various aspects of their environment in usually unwanted ways [6,98].In this paper we focus on characterizing the various ways that AI systems might influence and change humans (or other systems) in the environment.Our work can be thought of as an attempt to characterize negative side-effects that specifically pertain to mental states of humans in the environment.Some of the issues with choosing baselines for "natural shifts" have already been explored in this context [106].
Deceptive Design.Deceptive design refers to deceptive or manipulative digital practices, such as bait and switch advertising (in which products are advertised at much lower prices than they are available at), or "roach motel" subscriptions (which are very easy to start but take significant more effort to cancel) [29].Deceptive design patters (previously called "dark patterns" [147]) are generally manually crafted to exploit cognitive biases of users [63,107].
Persuasion.In philosophy, manipulation has often been characterized as influence that is neither coercive nor simply rational persuasion [121].However, some non-rational persuasion does not unambiguously seem manipulative, like graphic portrayals of the dangers of smoking or texting while driving, even though they provide no new information to the target [25].The line becomes more blurry for cases like personalized persuasive advertising [74].
Within the field of human-computer interaction, Fogg named the study of persuasive technology as captology [57].He defines persuasion as an attempt to change attitude or behaviour without using deception or coercion [58].Kampik et al. [91] amend this definition to be "an information system that proactively affects human behavior, in or against the interests of its users".They identify deception and coercion mechanisms on a variety of web platforms, including Slack, Facebook, GitHub, and YouTube.Recently, Bai [17] has shown that LMs are able to craft political messages that are as persuasive as ones written by humans, which is evidence of the growing potential of algorithmic persuasion.There has also been a long line of work on formalizing when rational (i.e.Bayesian) persuasion can occur [90].Pauli et al. [128] provide a taxonomy flawed uses of rhetorical appeals in computational persuasion, which they use to train models to detect persuasion fallacies.
Coercion.Wood [177] defines coercion as the act of limiting a target's range of acceptable choices to a single option.While both coercion and manipulation seek to guide the target's behavior, they differ fundamentally.Unlike manipulation, coercion doesn't compromise the victim's decision-making capacity.Instead, it capitalizes on the victim rationally selecting the sole option presented by the coercer [157].By this measure, coercion can be attractive for agents practicing it because the results are potentially more certain.Certain types of recommender systems such as search engines implicitly determine the choice-set for its users.If in a certain situation a user is reliant on the options presented to them by a certain recommender system, then that recommender system could be said to exert coercive power over the user by choosing to hide certain results in order to better meet its own objectives.
Algorithmic coercion has not received as much attention as manipulation in the literature concerning AI risks, but is a potential problem in the cooperative AI setting [49] where punishment strategies are an important part of game-theoretic analysis.It seems likely that a Diplomacy-playing AI should grasp the tactic of coercion to master the game [113].Coercion has received more attention in human computer interaction studies; in particular the study of persuasive and behaviour change technology [91].

POSSIBLE APPLICATIONS
We detail how our characterization of manipulation can be applied to recommender systems and language models.

Recommender Systems
A large literature focuses on recommender algorithms' effects on users [2,40,79,111,138].While some older works talk about "manipulation", this term is usually used differently than in our sense: for example, Adomavicius et al. [1] refer to recommender manipulation as the effect on users of showing artificially inflated ratings for specific content items (which is not something the recommender algorithm can usually actually decide to do).Zhu et al. [182] instead conflate manipulation and influence, equating manipulation with "any significant change in preference", which has significant drawbacks as mentioned in Section 2.4.More recently, some works have studied the incentives that recommender systems have to engage in manipulative behaviour to change user preferences [35,56,100].
How Manipulation Could Arise.Changes in recommender algorithms can affect user moods [99], beliefs [4], and preferences [51].This shows that current systems may already be capable of manipulating users in some simple ways.Furthermore, it seems plausible that the spread of angry content [24] or clickbait [179] on social media is in part due to one-timestep manipulative incentives for the recommender: while such issues are likely at least in part due to network or supply-and-demand dynamics [117], the behavior is also consistent with the recommender systems themselves learning features such as whether a post is anger-inducing or has sensationalized language, and exploiting such features by preferentially up-ranking the corresponding content [114].Up-ranking this content brings advantages to user engagement.Notably, recommender companies have had to engage in explicit down-ranking of angry and clickbait content [153,160,179].While these potentially manipulative behaviors might not be as worrying as others (e.g.intentionally attempting to induce social media addiction [5,76]), they constitute some evidence that manipulative behaviors are learnable and may have been learned in real systems.
Many platforms (YouTube, Meta, etc.) seem to be considering switching to optimizing long-term metrics with more powerful RL optimizers [3,31,41,61,68], with one of the original motivations being that of reducing clickbait-like phenomena [13].Ironically, this switch opens the opportunity for long-horizon manipulative behaviors to emerge, which will likely be harder to detect and measure.Subtle, long-horizon behaviour might go undetected without dedicated monitoring.Moreover, even without using RL explicitly, the outer loop of training, retraining, and hyperparameter tuning supervised learning systems that optimize short-term metrics might exert optimization pressure towards long-term manipulative strategies that most increase company profits [100].
Measurement.Establishing that a given recommender system has engaged in manipulation is difficult.Firstly, recommender systems of almost all popular platforms are proprietary, due to concerns about strategic manipulation (otherwise known as "gaming").It is difficult or impossible for external researchers to gain access to these systems [141].Even Twitter, which has open sourced some components of its algorithm, has not (as of yet) provided access to its most important component for manipulation-auditing purposesits models' weights [162].Moreover, perverse incentives are at play since a concrete demonstration of manipulation, if publicized, would likely result in negative repercussions for the company [173,174].Second, establishing that a harmful user shift has occurred can be difficult.One could try to use the above-mentioned techniques from Carroll et al. [35] and Farquhar et al. [56] -but they inevitably require engaging in a value-laden debate about whether the shift was harmful.
One potentially promising direction might be querying users' meta-preferences [11,95]; e.g.Meta could ask, "how much time would you want to spend next month on Facebook?", and have its recommender systems take such a stated preference into account.
In line with philosophical work on ethical nudging under changing selves [132], one could additionally ask whether users approve of the change once it has occurred [133]. 2 One advantage of this approach is that it can ground notions of manipulation in what users explicitly state they want.However, this approach would not entirely escape value judgements: platforms have direct conflicts of interest with some users' meta-preferences, and respecting certain meta-preferences may be ethically unacceptable.

Language Models
Natural language is a useful way to interact with digital environments.Given advances in augmenting language models with tools [38,42,120,142], such models could mediate an increasingly significant portion of our digital interactions.
How Manipulation Could Arise.The simplest way that manipulation could arise in language models is by imitating manipulative behavior in internet data [125].Data in pre-training sets, such as novels, contain examples of manipulation.Filtering out manipulation can be difficult because it can be subtle.LMs could learn to emulate this behaviour through pre-training and exhibit it in response to an appropriate prompt [17,64].Some evidence suggests that LMs learn to infer and represent the hidden states of the agents (i.e., the humans) that generated the pre-training data [7,64,83,97], although this view remains contested [109,163].
There is some uncertainty as to other ways in which manipulation might arise in practice in LMs.One possibility is if manipulation of humans is instrumentally useful [152].Manipulation seems instrumental in the game of Diplomacy, for instance [113], which requires negotiating with other players to form alliances and capture territory.Another possible source of manipulation is reinforcement learning from human feedback (RLHF) [44].RLHF involves learning a reward function from human feedback to represent a human's preferences, and subsequently training an AI system to optimize that reward function.In general, there may be an incentive for the AI to exert control over the human and their feedback channel so as to maximize reward [93].
In the context of language, RLHF is used to finetune LMs to maximize a human's approval of their behaviour.Without constraints on behaviour, systems trained with RLHF likely have an incentive to obtain human labelers' approval by any means possible, including potentially manipulative avenues.For example, Snoswell and Burgess [150] remark that LMs often seem authoritative even when the information they provide is wrong.A possible explanation is that authoritative language fools human labelers to approve such outputs despite their underlying incorrectness.Chatbots trained with RLHF could also use emojis to appeal to emotions in manipulative ways [166].
Measurement.Work on measuring the incentives and intents of the behaviour of LMs is still preliminary.No existing work applies the CID framework to LMs.While advances in interpretability may help to identify CIDs, this direction remains speculative.
Another line of work has focused on understanding how certain training objectives and environments cause AI systems to generalize differently.Langosco et al. [102] and Shah et al. [146] show that both language models and general RL agents can pursue different goals in out-of-distribution environments even when trained to perfect accuracy on in-distribution environments.
Some recent work has focused on studying the harms and behaviour changes due to LMs.Bender et al. [23], Weidinger et al. [171] outline risks that LMs pose, including informational harms like disinformation.There is a body of work that measures the effects of user interaction with chatbots, in areas like mental health [164], customer service [9], and general assistance [46,82].Since there are likely to be domain-dependent manipulation techniques, it would be important to build on this existing work for measuring manipulation.

REGULATION OF MANIPULATION
Regulation across different domains apply to human manipulation.In this section, we identify how existing regulations may apply to algorithmic manipulation.
Law.Some manipulation-adjacent acts such as deception or coercion are considered to be sufficiently morally wrong for them to be considered by criminal law.Alternatively, the regulation of certain manipulative practices might have economic justification.In instances where there is an severe asymmetry in power between parties, anti-manipulation regulation can play a role to further social goals such like fairness or the protection of human rights.Antimanipulation law might therefore appear in contract, tort, competition, market regulatory, consumer or employment law; but as Sunstein [155] notes, such fracturing makes building a common and consistent legal account of manipulation difficult.
Commerce.Calo [33] considers how the trend for extensive datagathering on individuals makes them more vulnerable to tailored manipulative behaviour.Specifically, digital commerce companies might be able to use finegrained data to limit the consumer's ability to pursue their own interests in a rational manner.He characterises market manipulation as "nudging for profit" and cites the "persuasion profiling" of Kaptein and Duplinsky [92] as one particular example where companies alter their advertising.
Willis [175] sees manipulation of consumers as inevitable in the face of AI-enabled systems designed to maximised profit.Unless law and evidential standards are updated, she argues that enforcement will be very difficult.Although intent is not a prerequisite of most state and federal deceptive trading practice law, since it is so difficult to prove, courts still see its proof as a key piece of evidence.This is problematic given the lack of legal precedent concerning intent in algorithms.Further, Willis [175] points to the practical difficulties in proving that a personalised advert is manipulative: typical reasonable person tests are no longer applicable in a world where marketing material for example might be both targeted for specific individual at a specific point in their day.Organisations that use this type of microtargeting personalization generate so many different user experiences that they might not be able to feasibly monitor them all or recover them when required.
Aside from applications in commerce, microtargeting and related AI-induced manipulation have been discussed as a risk to democratic society [145].Zuiderveen Borgesius et al. [183] discuss the prospect of tailoring information to boost or decrease voter engagement.Microtargeting is related to hypernudging, which is the use of nudges in a dynamic and pervasive way that is enabled by big data [116,178].Nudging [159], which is the design of choice architecture to alter behaviour in a predictable way without changing economic incentives or reducing choice, has long been criticized as potentially being manipulative; for a review of the arguments and counterarguments see Schmidt and Engelen [143].We note that nudging is actively being pursued in recommender systems [84].
Finance.The spectre of algorithm-led manipulation has already received widespread attention in financial markets.A wide number of financial regulatory laws prohibit a variety of market manipulative practices [135] and algorithmic trading already dominates almost all electronic markets.Unfortunately, a consistent rationale as to why certain trading practices are deemed legal whilst others are not is not forthcoming [47].Financial regulators following a principles-based approach generally characterise market manipulation as behaviour which gives a false sense of real supply and demand, and by extension price, in a market or benchmark.Market manipulation must be intentional in the US [37], while in the UK intention is not a requirement [14].As Huang [78] notes, removing intent requirements from regulation, particularly criminal law, is not straightforward.
Regulations designed primarily to regulate human traders may be difficult to enforce in a world where algorithms transact with each other [105].Bathaee [21] and Scopino [144] both zero in on the intent requirement in proving instances of market manipulation.The view that existing regulations are not sufficient to police market places populated by autonomous learning algorithms is becoming more accepted [16] and solutions are beginning to be mapped out [15] which aim to balance the need to reduce the enforcement gap without unduly chilling AI use in marketplaces.

PRACTICAL CHALLENGES FOR FUTURE RESEARCH
The study of manipulation from AI systems presents a number of practical challenges.As shown in Table 2, studies of manipulation can be categorised along each of two axes, making four classes.
The first axis concerns whether the studied system is deployed or simulated.It is extremely difficult for academics and regulators to obtain access to deployed models.While the companies deploying such systems likely have motivated and competent researchers who study manipulation and other problems, institutional barriers can stymie their work, and conflicts of interest may influence crucial decisions.For example, company executives may withhold funding from lines of work deemed too threatening to the company's bottom line [131].
The second axis concerns whether the studied targets of the manipulative system are real or simulated.Simulation has been popular for modelling the effect on users of recommender system [35,48,86,111], but not without criticism [39,176].Simulation is cheaper, but has reduced validity, particularly as preference change is not well understood [11,59,65].Efforts exist to address this issue empirically [130] and theoretically [70].Another advantage of simulation is that the many ethical implications of running manipulation experiments [66] are reduced when the subjects are not human.
Table 1 assesses the difficulties of AI manipulation experimental research.Other than those already mentioned in this section, two further issues exist related to causality.Firstly, as observed in [91], manipulative and adjacent practices are likely to exist simultaneously, so some care needs to be taken to separate them.Secondly, since interaction with any stimuli will change the user, the nonvolitional element of that change needs to be measured in order to assess manipulative impact [12].This challenge has no obvious solution; existing approaches have either attempted to simulate a natural preference evolution [35,56] or just pretend the user had never interacted with the system [54,182].

CONCLUSION
Although designer intent remains salient, the deployment of opaque and increasingly autonomous systems heightens the importance of a conception of manipulation that can account for manipulation occurring without designer intent.Such manipulation could emerge because it is favoured under the training objective (such as engagement maximization in certain content recommendation settings), or because a model learns to imitate manipulative behavior in its training data (such as manipulative text in language modeling).
We characterized the space of possible definitions of manipulation from AI systems.We analyzed four axes mentioned in prior literature in the context of manipulation by algorithms and other humans: incentives, intent, covertness, and harm.Incentives concern what a system should do to optimize its objective; intent concerns whether a system behaves as if it is reasoning and pursuing an incentive; covertness concerns whether those affected by the system's behaviour meaningfully understand what the system is doing; harm concerns the extent to which the behavior of the AI system negatively affects its users.Although work to operationalize each of these axes exists, fundamental challenges remain.
Manipulation threatens human autonomy [101,134,158].Despite the difficulty of formalizing and measuring manipulation, precautionary action is warranted to anticipate and mitigate potential cases of such behavior.Mitigating actions could include making auditing easier to perform [136,137], addressing perverse incentives to build manipulative systems [31], and improving user understanding of AI systems' functioning.Both technical and sociotechnical work to define and measure manipulation should continue, but we should not require certainty before engaging in precautionary and pragmatic mitigations.Any interaction with the system will likely change the user.Some of this change is self-induced or desired by the user.
It would be wrong to attribute the responsibility for that change to the recommender.This is a puzzle for experiment design -what baseline should be used to measure the presence or absence of manipulative behaviour?E.g. [35,56] Simulation of either the user response or the manipulator raises questions about the realism of the simulation.This is particularly acute when simulating humans as the manipulee since this requires modelling their beliefs, behaviour or preference and how it they may change as a result of a manipulative scheme, in addition to exogenous factors.EthicalExperiments which involve the manipulation of humans are ethically problematic.Experiments which reveal previously unknown vulnerabilities of humans or other systems could constitute info hazards.AccessOwners of systems that might be manipulative have no obvious incentive to allow independent oversight.Data access for researchers is an issue unless they are prepared to build systems to gather and store relevant data themselves.onlyexhibittheir effects over long periods of time.This means that efforts to detect it with human users are expensive and trickier to administer.Manipulative effects may be subtle over the typical durations that lab-based user studies take.MeasurementMeasuring e.g.preference, belief, or mood change is not straightforward.Behavioural change is easier to measure but will likely not capture all induced change.

Table 1 .
try to estimate 'natural' preference trajectories.Causal Attribution Manipulative strategies may also be coercive or work in a number of ways simultaneously.How do we attribute behaviour change to manipulation vs other concepts like persuasion or coercion?We set out key challenges to a manipulation experiment and rate their difficulty versus the four experiment types described in Table 2. L = Low, M = Medium, H = High, -= N/A Lab-based User Study (LUS) Simulated 2. Sock Puppet Audit (SPA) 4. Lab Simulation Study (LSS)