Capacity Constrained Influence Maximization in Social Networks

Influence maximization (IM) aims to identify a small number of influential individuals to maximize the information spread and finds applications in various fields. It was first introduced in the context of viral marketing, where a company pays a few influencers to promote the product. However, apart from the cost factor, the capacity of individuals to consume content poses challenges for implementing IM in real-world scenarios. For example, players on online gaming platforms can only interact with a limited number of friends. In addition, we observe that in these scenarios, (i) the initial adopters of promotion are likely to be the friends of influencers rather than the influencers themselves, and (ii) existing IM solutions produce sub-par results with high computational demands. Motivated by these observations, we propose a new IM variant called capacity constrained influence maximization (CIM), which aims to select a limited number of influential friends for each initial adopter such that the promotion can reach more users. To solve CIM effectively, we design two greedy algorithms, MG-Greedy and RR-Greedy, ensuring the $1/2$-approximation ratio. To improve the efficiency, we devise the scalable implementation named RR-OPIM+ with $(1/2-\epsilon)$-approximation and near-linear running time. We extensively evaluate the performance of 9 approaches on 6 real-world networks, and our solutions outperform all competitors in terms of result quality and running time. Additionally, we deploy RR-OPIM+ to online game scenarios, which improves the baseline considerably.


INTRODUCTION
Given a social network , a diffusion model M describing how a user is influenced via social connections, and a constant , the influence maximization (IM) problem asks for  users in  that can (directly and indirectly) influence as many users as possible under M. It finds applications in viral marketing [11], network monitoring [25], rumor control [4], and so on.Amid them, viral marketing is the scenario where IM originates from.In this scenario, a company seeks to promote a product by incentivizing a few influential individuals in hopes of creating a cascade of adoptions via the word-of-mouth effect.Apart from the cost issue, the individual's capacity for spending efforts on consuming the promoting content becomes crucial when the promotion and incentive go virtual.In particular, multiple studies [34,51] show that users have limited time to spend on social media.This leads to limited adoption of promoting content, despite its widespread distribution.
However, existing IM and its variants neglect the capacity of individuals, hindering their practicalities in the corresponding marketing scenario.Before explaining, we first introduce incentive propagation events of online games, whose objective is stimulating interactions between friends.Specifically, the service provider of online gaming platforms distributes virtual incentives to a set of pre-selected users, called active participants (APs), who are more likely to participate in this promotion, and encourages them to play with friends from a recommendation list.Once the recommended friends play with APs, they also attain the incentive and can share it with their friends during daily co-playing.The crux of in-game incentive propagation events is recommending existing friends to APs.This is because each user has numerous friends but only has a limited capacity to play with them.In addition, we observe that the APs are more likely to be the friends of influencers rather than the influencers themselves, highlighting the importance of choosing influential friends for APs.In contrast, IM and most of its variants assume that each selected influencer unconditionally adopts the promotion from the merchant without relying on friends, which contradicts this insight.Moreover, it is rather challenging to utilize IM and corresponding solutions [22,46,49], since independently selecting friends for each AP by the IM solver can incur immense computational overhead, and the result quality remains unclear.
To this end, we propose a new IM variant called the capacity constrained influence maximization (CIM) problem.Given a social network , a diffusion model M, a set  of APs and a constant , CIM aims to find  influential friends (seeds) for each user in , such that the number of influenced users starting from all distinct seeds is maximized under M. To solve CIM, we design a vanilla solution MG-Greedy, which employs a greedy strategy to select the best feasible seed from all neighbors of  and provides a 1/2-approximation guarantee.In addition, we propose the solution RR-Greedy to select seeds for each user of  in a round-robin manner, which improves the time complexity of MG-Greedy by a factor of || and ensures at least 1/2-approximation.To improve the efficiency, we further propose the scalable implementation RR-OPIM+, which shares the same framework with the state-of-the-art IM solution OPIM-C [46] but is redesigned carefully to ensure the correctness for CIM.Most notably, RR-OPIM+ achieves (1/2 − )-approximation in a nearlinear running time w.r.t. the network scale.
In experiments, we first provide an empirical configuration for CIM based on incentive propagation events.Subsequently, we extensively evaluate the performance of 9 approaches on 6 real-world networks with up to 3 billion relationships.Notably, our proposals outperform all competitors by up to 39% in terms of result quality.Besides, RR-OPIM+ speeds up greedy algorithms by at least 4 orders of magnitude.In addition, we deploy our solution RR-OPIM+ to the online gaming scenario, which improves the baseline by up to 5.39% in the corresponding evaluation metric.
To summarize, we make the following contributions in this work: • We conduct an empirical study to verify the difference between in-game incentive propagation and IM.Motivated by these observations, we propose a new IM variant called CIM. (Section 3) • For effectiveness, we propose two CIM solutions MG-Greedy and RR-Greedy with approximation guarantees.(Section 4) • For efficiency, we provide a scalable implementation with rigorous theoretical analysis for these greedy algorithms.(Section 5) • We discover the detailed settings for CIM and conduct experiments to show the superiority of our proposals.(Section 7) • We deploy the proposal to the in-game incentive propagation event, which achieves considerable improvement.(Section 8)

PRELIMINARIES
We abstract a social network as a graph  = ( , ), where  is a set of  nodes (representing users) and  is a set of  edges (representing relationships).We assume that  is a directed graph and each edge  , ∈  indicates that  is a follower of and can be influenced by .We call  (resp.) the in-neighbor (resp.out-neighbor) of  (resp.).Furthermore, we use   to denote the set of out-neighbors of .For an undirected graph, we replace each undirected edge  , with two directed ones in opposing directions, i.e.,  , and  , .In the sequel, we elaborate on the background of influence maximization (IM), followed by in-game incentive propagation scenarios.

Diffusion Models
This work focuses on two well-accepted diffusion models, named Independent Cascade (IC) [14] and Linear Threshold (LT) [15].Given a graph  and a set  of chosen users (called seeds), both models assume that each user  ∈  has two possible states: inactive or active, and describe the diffusion of an item from  in a stochastic manner.Specifically, the states of seeds are set to be active at the initial step  = 0, and at step  > 0, the newly activated users try to influence their inactive out-neighbors as follows.
• IC introduces the influence probability  , for each edge  , , representing the likelihood that  is successfully activated by .
At each step  > 0, each user  who is activated as step  − 1 has one chance to influence each inactive out-neighbor  with  , .• LT assumes that each edge  , is associated with a weight  , satisfying  ∈    , ≤ 1, where    is the set of in-neighbors of .At  = 0, the threshold   ∈ [0, 1] is uniformly sampled for each user .For  > 0, an inactive user  is activated if  (, ) exceeds the threshold   , where  (, ) is the summation of  , w.r.t.'s in-neighbor  that was activated before step .A user remains active in all subsequent steps once it is activated.The influence process continues until no more users can be activated.

Influence Estimation and Maximization
Given a graph , a model M, and a constant , the influence maximization (IM) problem [11,22,39] asks for a seed set  with cardinality | | =  such that the influence spread  ,M () is maximized.Kempe et al. [22] formulate this problem and provide a (1 − 1/)−approximate IM solution if the exact spread  ,M () is known for any .Due to the #P-hardness [6] of evaluating  ,M (), most of previous solutions [3,17,18,22,25,46,47,49,50,58] employ the Monte-Carlo (MC) simulation [22] or Reverse-Reachable (RR) set sampling [3] to estimate the spread, yielding a (1 − 1/ − )−approximate result.In what follows, we briefly review these two estimation techniques and representative solutions based thereon.MC simulation.Given a graph , a model M, and a node set , the MC simulation starts from  and estimates  ,M () following the discrete step of M in Section 2.1.To reduce the estimation variance, the simulation is conducted  times, and the average number of activated nodes in  trials is recorded as the estimation of  , ().Kempe et al. [22] propose a greedy solution based on MC simulations, which iteratively includes a node  into  as the -th seed if the marginal gain  ,M ( |) is the largest.To avoid evaluating  ,M ( |) for  () nodes in each iteration, Leskovec et al. [25] exploit the submodularity and propose a practical implementation called CELF, which skips  while selecting the -th seed if 's marginal gain is sufficiently small in the iteration prior to .RR set sampling.Borgs et al. [3] propose to estimate the spread by sampling random RR sets, defined as follows.
Definition 2.1 (RR Set).Given a graph  and a model M, a random RR set  ,M is a set of nodes, generated by (i) first selecting a node  at random, (ii) then sampling a subgraph  from  in terms of M, (iii) finally preserving the nodes that can reach  in .R ,M is denoted as a set of random RR sets.
In Definition 2.1, the subgraph  is sampled based on the influence process of M. For IC,  is induced by removing each edge  , in  with the probability 1 −  , .Borgs et al. [3] prove that is an unbiased estimator of  ,M (), where R ,M is a set of  random RR sets and Λ R,,M () is the coverage of  in R ,M , i.e., the number of RR set  ,M ∈ R ,M satisfying  ∩  ,M ≠ ∅.By this connection, Borgs et al. [3] sample a sufficient number of random RR sets as R ,M , and employ the greedy framework in [22] to iteratively select the next node  with the largest marginal coverage The related solutions [17,18,46,49,50] follow the greedy strategy in [3] and improve its efficiency by reducing  while ensuring the same approximation ratio.At this front, OPIM-C [46] is state of the art and is applied to subsequent IM solutions and variants [2,17,18].
In addition, a line of solutions [9,35] estimates spread by sampling graph instances, and a line of solutions either leverages centrality scores [7,21] or simplifies the diffusion model [6] to generate seed heuristically.We refer interested readers to [28] for details.

In-Game Incentive Propagation
Event procedure.The online gaming platform designs the incentive propagation event to boost user engagement, whose procedures are as follows.Given a social network  = ( , ), the service provider first selects a set of users, named active participants (APs), who are more likely to engage in this event.We denote this set of APs as  with || =  ≪ .We say a user  is a passive participant (PP) if  ∈  \.After  is chosen, the service provider then distributes this event to , including the mission details, the incentive  (e.g., extra credits), and a list of PP friends (named passive seeds). is initially possessed by  and is shared by the following two steps, where we say a user is activated if attaining  .• Seed activation.Each AP  is the initial active user and can invite the recommended seed , who becomes active automatically after playing with one of AP inviters for the first time.• Daily contamination.Starting from activated seeds,  can be recursively shared from an active PP to an inactive PP friend if they play together for the first time during the event.Notice that the provider can recommend at most  passive seeds for each AP, since a user has a large number of friends (from gaming and messaging platforms) but only has a limited capacity to play with them.The passive seed set for each AP is defined as follows.Existing and possible selection approaches.The linchpin of the incentive propagation event is selecting  seeds for each AP.Existing strategies can be generalized to a local framework, which independently ranks the passive friends for each AP  ∈  in descending order based on a heuristic score, e.g., degree, PageRank, the number of historical interactions, and so on [20,29,30,32,36], and select top  friends with the highest scores as seeds.To improve the number of engaged users, the idea of IM could be utilized by invoking an IM solver as mentioned in Section 2.2 and selecting  seeds from   for each AP .However, this local framework yields numerous overlapping seeds, each of which is assigned to more than one APs, rendering the compromised spread.In the meantime, although the IM solver can output an approximate seed set w.r.t.  of each AP , the quality of the overall seed set is unclear.

THE CIM PROBLEM
In this part, we first conduct an empirical study to (i) clarify the difference between IM and the in-game incentive propagation scenario (Observation 1), and (ii) verify the drawback of the local framework (Observation 2).Motivated by these observations, we then formulate capacity constrained influence maximization (CIM).

Motivating Insights
We collect user logs from an incentive propagation event of a Tencent role-playing game, which follows the procedure in Section 2.3, and call this dataset TXG.In particular, the service provider selects the monthly active user as the AP, who is randomly assigned one of the following strategies: PageRank, degree, or the number of historical interactions, to obtain at most  = 40 recommended friends.TXG consists of (i) a social network  with 243.4 thousand users and 5.9 million undirected friendships, (ii) 7.6 thousand APs and 3.9 thousand seeds involved in seed activation, and (iii) 18.5 thousand PPs activated in daily contamination.
In the first set of observations, we explore whether influencers tend to participate in the event through the service provider or via invitations from friends.To reduce the bias during the exploration of user influence, we excluded APs and their seeds recommended by the influence-based strategy degree or PageRank.For each AP and seed, we define the one-hop spread as the number of seeds that played with each AP in seed activation and the number of PPs that are directly activated by each seed in daily contamination, respectively.Figure 1(a) reports the distribution of the one-hop spread of the rest APs and their seeds, providing the insight below.
Observation 1 (Less-Influential APs).The APs of in-game incentive propagation events are less influential than their seeds.
Specifically, the average one-hop spread of seeds is 7% larger than that of APs, and the fraction of passive friends with a one-hop spread larger than 50 is 4× more than APs.
To demonstrate the existence of overlapping seeds in existing strategies, Figure 1(b) reports the distribution of the number of in-neighbor APs of each seed and the distribution of the number of APs who have invited this seed.We call a seed an overlapping seed if it has more than one in-neighbor AP.As shown in Figure 1(b), since seeds are independently recommended to each AP, 46.3% fraction of seeds are overlapping seeds, which incurs 22.1% fraction of passive seeds invited by more than one APs.However, the engagement likelihood of a seed is weakly related to the frequency of being invited.Specifically, for seeds invited once, twice, and third times, the fraction of engaged passive seeds is 41.5%, 45.4%, and 42.6%, respectively.This leads to the second observation.
Observation 2 (Overlapping Seeds).Due to the overlapped neighborhood and the independent recommendation strategy, the passive seed can be invited by multiple APs, however, the repeated invitation has a slight impact on the engagement willingness of seeds.
To ensure the robustness of our findings, we expand our exploration to two additional in-game incentive propagation events, which confirm our previous observations.These observations can also find support from Epinions [33], Twitter [1], and Facebook [8].

Problem Formulation
Prior to CIM, we first define the spread of passive participants.Distinct from the spread  ,M () in IM, the spread of PPs is measured on the induced subgraph .This is due to that the incentive propagation event treats APs as active users prior to seeds, and the influence paths from seeds to other PPs via any APs are invalid by rules.Notice that the individual capacity for PP's interactions is considered into M by leveraging the historical interaction logs, which will be explained in Section 7.2.To address the overlapping issue in Observation 2, we evaluate the union of seeds of all APs, which intuitively recommends more distinct seeds and can fully leverage the recommendation list.For ease of presentation, we omit the subscript of  ,M (•) unless otherwise specified.We now introduce CIM in Problem 1, which is NP-hard as shown in Corollary 1.For ease of exposition, we defer all proofs to Appendix A.
Problem 1 (Capacity Constrained Influence Maximization (CIM)).Given a graph , a model M, an AP set , and a constant , CIM aims to find the optimal passive seeds set  *  for each  ∈  such that the spread Besides the incentive propagation scenario of online games, CIM can be applied to more complex viral marketing scenarios, such as advertising to different cohorts of users [16].In addition, it is worth noting that CIM only identifies influential friends for APs.The interaction inclination between the AP and influential friend is orthogonal to this work, which can be directly applied to our proposal and will be explained in Section 8.

GREEDY ALGORITHMS
In this section, we first introduce the main idea of our proposal, and then propose two greedy solutions, named Maximal Gain Greedy (MG-Greedy) and Round-Robin Greedy (RR-Greedy), including the correctness and complexity analysis.We assume that each greedy solution has an oracle to access the exact spread value  (•) and we focus on the number of times accessing the exact spread in the complexity analysis.Regarding the estimation and its overhead of  (•), we defer them to Section 5.

Main Idea
Since the spread of PPs is a non-decreasing submodular function, we solve CIM by converting it to one instance of submodular maximization under partition matroid [12], whose concept is as follows.
Different from partitions  1 , . . .,   in Definition 4.1, the passive friend set   for each  ∈  is not strictly disjoint in CIM.To this end, we use the calligraphic notation C  = { , :  ∈   } (resp.S  = { , :  ∈   }) to represent the set of distinct edges from AP  to its passive friends (resp.passive seeds).Analogously, we let C =  ∈ C  and S =  ∈ S  be the global candidate space of AP-PP edges for seed selection and the set of selected AP-seed edges, respectively.Hence, CIM turns to find S * ∈ I such that  (S * ) is maximized, where the partition matroid (C, I) is In what follows, we use the calligraphic notation S, where  (S) =  () and  ( , |S) =  ( |).

MG-Greedy and RR-Greedy
MG-Greedy.We design a greedy solution MG-Greedy, extending the greedy solution of IM [22].Intuitively, MG-Greedy selects the edge with the largest marginal gain from the global candidate space C while maintaining the partition matroid.As illustrated in Algorithm 1, at each iteration (Lines 3-5), MG-Greedy selects the edge  , with the largest marginal gain  ( , |S) from C, where ties are settled arbitrarily.If S  is feasible (i.e., |S  | < ), the selected  , is then added to S; otherwise, all edges starting from  are removed from C. MG-Greedy accesses the spread function  ( •  • |C|) times, containing  (|C|) times of spread function invocations in each of  ( • ) iterations.As proved by [12], MG-Greedy has the following guarantee when the spread oracles are given.Theorem 4.2.Given a graph , a model M, an AP set , and a constant , let S * be the optimal solution of CIM.The output S of MG-Greedy satisfies  (S) ≥  Hence, RR-Greedy is 1 1+ -approximate, which is superior over MG-Greedy.The constant  in Theorem 4.3 is also known as curvature [13] and is further bounded by .

SCALABLE IMPLEMENTIONS
To estimate the spread, we propose a scalable implementation RR-OPIM+ for RR-Greedy, extending the state-of-the-art OPIM-C [46].
In the sequel, we introduce the main idea of it and clarify its difference from OPIM-C, followed by its implementation and analysis.

Main Idea
Given a graph , an AP set A and the induced subgraph  with   nodes, let R ,M be a set of random RR sets constructed from .Akin to Section 2.2, the connection between a random RR set  ,M ∈ R ,M and the spread Hence, the objective in CIM can be solved by finding the seed  with the maximum coverage Λ R,,M () in R ,M , and the greedy solutions in Section 4 can be efficiently implemented by replacing the evaluation of  ,M ( |) with Λ R,,M ( |).In analogy to  ,M (•), we omit the subscripts  and M of R ,M (•) and Λ R,,M (•) in the following contexts.Furthermore, the coverage Λ R () = Λ R (S) and marginal coverage Λ R ( |) = Λ R ( , |S) in the edge notation.
Algorithm 2: RR-Greedy (, , , ) The pseudocode of RR-OPIM+ is illustrated in Algorithm 3. Initially, RR-OPIM+ defines the constants   and  (Lines 1-2), representing the worst-case and the initial number of random RR sets, respectively, and then constructs two sets R 1 and R 2 , both containing  random RR sets (Line 3).After that, it runs in an iterative manner to verify if the selected seed pairs have satisfied the approximation guarantee by using less than   random RR sets.At each iteration, it first invokes RR-Greedy by using Λ R 1 (•) as the evaluation function and selects the set S of AP-seed edges (Line 5).To verify if the selected S provides the desired approximation guarantee, it then computes the upper bound   (S * ) of  (S * ) (resp. the lower bound   (S) of  (S)) by using R 1 (resp.R 2 ) (Lines 6-7).RR-OPIM+ is early terminated with current S if Or, it doubles the sizes of R 1 and R 2 and continues (Lines 8-9).
Although RR-OPIM+ and OPIM-C [46] share the framework as shown in Algorithm 3, there remain three challenges while considering CIM and RR-Greedy: (i) a new   requires devising to ensure the approximation guarantee in the worst case; (ii)   (S * ) and   (S) are to recompute to secure the approximation for any of   early terminations; (iii) the result returned from any of abovesaid criteria is correct with a high probability.To address these issues, we implement the following three major modifications.

Detailed Modifications
Computing   .As shown in Line 1 of Algorithm 3, RR-OPIM+ first generates a random seed set S by assigning each candidate PP to an arbitrary AP friend while ensuring the partition matroid I.It then records the number of distinct seeds in S as , which is a lower bound of the optimal spread  (S * ) and is required for deriving   .The following lemma provides the setting of   , ensuring the correctness of RR-OPIM+ when  =   .Lemma 5.1.Let R be a set of random RR sets,  be defined as Line 1, and S be the result obtained by Line 8 of Algorithm 3.For fixed  and , if |R| ≥   and then S is (1/2 − )-approximate with at least 1 − /3 probability.
Bounding  ( * ) and  ().We next derive the lower bound   (S) of  (S) and the upper bound   (S * ) of  (S * ) such that the approximation ratio ( S * ) .The settings are as follows.
Algorithm 3: RR-OPIM+ (, , , , )  Accordingly, a tightened upper bound   (S * ) is where the failure probability for the result in the worst case is at most /3.By the union bound, the correctness of RR-OPIM+ is as follows.
Theorem 5.4.Given a graph , a set of APs , a model M, and a constant , let S * =  ∈ S *  be the optimal solution of CIM.For every  > 0 and  > 0, RR-OPIM+ yields a ( 1 2 − )-approximate output S =  ∈ S  , with probability at least 1 − .
Moreover, we have the following theorem to guarantee the expected running time of RR-OPIM+.
, where  * is the optimal seed set,  * is the node with the largest spread in .

ADDITIONAL RELATED WORKS
In this part, we revisit problems that are germane to our work at first glance and distinguish them from in-game scenarios and CIM.Amid them, a line of works employs IM for link prediction.For example, active friending [53] and IM variants based on edge insertion [5,10,23] recommend people-you-may-know to increase the acceptance probability and boost the influence spread, respectively.In contrast, the in-game scenario is to sort existing friends for APs, and the objective of CIM can be fundamentally treated as the ranking measure.Lu et al. [31] focuses on the comparative IM problem and considers scenarios where two products are promoted simultaneously, which is beyond the scope of our study.Li et al. [27] leverages the Hawkes process to infer the diffusion, and it is orthogonal to our work.Prior works in [19,42] consider adaptive seeding and assume that the seed selection is a two-stage framework, which first selects a set  from a given subset of  , followed by selecting another seed set  from the influenced neighboring nodes of .The objective is to maximize the expected influence spread of  under the cardinality constraint of | | + | |.Distinct from adaptive seeding, in-game incentive propagation only focuses on the second stage but requires a partition matroid constraint.The self-activation influence maximization [44] introduces a concept called self-activated user, which is similar to AP but also fails to consider the capacity of each AP.Another related work is multi-round influence maximization [45].It considers the scenario requiring multiple rounds of promotions and may desire to repeatedly select the same influential user, whereas this contradicts Observation 2. Huang et al. [20] study influence maximization of online gaming platforms but also consider the summation of the influence of single seeds.In addition, targeted influence maximization [43] aims to select seeds to influence more users from a targeted subset, whereas CIM describes a reverse problem starting from a subset.

EXPERIMENTS
We first introduce the experimental settings, followed by conducting an empirical study on TXG to explore the configurations for CIM.At last, we evaluate the performance of the proposed algorithms in terms of quality and efficiency.All experiments are conducted on a Linux machine with Intel Xeon(R) Gold 6240@2.60GHzCPU and 377GB RAM in single-thread mode.None of the experiments need anywhere near all the memory.Due to space constraints, we refer interested readers to Appendix A for more experiments, e.g., AP selection and sensitivity analysis w.r.t.other constants.Orkut [54] and Twitter [55], whose statistics are shown in Table 1.

Empirical Configurations for CIM
Selecting M. First, we bridge the gap between model M (IC or LT) and the actual dissemination from the engaged seeds.To generate the diffusion among PPs, we use the co-playing logs with tuples (, , , ) representing that an active PP  played together with the passive friend  at timestamp  , , and clean logs by preserving the earliest timestamp for each distinct co-playing relationship.We construct the diffusion trees [38] from the co-playing logs.In particular, we treat the seed as the root of each tree and add the directed edge (, ) to the tree if (i) there exists an edge (, ) on the tree satisfying  , <  , and (ii) the tree is acyclic after insertion.To capture the diffusion of co-playing behaviors, we follow prior works [22,49,50] to normalize the influence probability and weight in IC and LT by leveraging daily co-playing times between friends.Analogously, the model-predicted diffusion can also be preprocessed into a diffusion tree.Figure 2(a) reports the RMSE between the amount of predicted and true active PPs in each hop , where the active users have the same shortest distance  from the seed set .We find that IC has a better RMSE on each hop, indicating that the actual diffusion is more similar to IC rather than  LT and motivating us to leverage IC for CIM.To explain, the inactive  acquires the incentive automatically once  plays with one of its active friend  for the first time, which resembles the process of IC.
Selecting .We next explore the setting of .Specifically, we collect each invited seed 's rank (i.e., position) in the recommendation list of an AP .To summarize, an AP is more likely to invite seeds in the top 20 positions, where seeds are usually overlapping.

Performance Evaluation
In the second set of experiments, we compare the performance of each algorithm in terms of effectiveness and efficiency.Regarding the game dataset TXG, we retain the seeds that continued to daily contamination and their preceding APs, resulting in 794 APs and 1.7 thousand seeds.These 1.7 thousand seeds are treated as candidates, and each algorithm is asked to choose 1 seed for each of the 794 APs from their respective candidates.The actual spread of a subset  of candidates is the number of PPs activated when those seeds are chosen.To configure CIM on remaining public datasets, we only leverage the fraction of APs as explored in Section 3.1, and uniformly sample 5% fraction of users from  as the AP set , i.e., / = 5%.After determining , we leverage IC as M on the derived subgraph  and choose the constant  ranging from 2 to 20.We treat  and the corresponding  as a query set, and report the average score after repeating on 5 random query sets.We exclude an algorithm if it fails to return in 24 hours.Effectiveness analysis.Table 2 shows the actual spread of the  selected by each method (the timeout one is omitted), where the proposed RR-OPIM+ outperforms other methods by up to 9.87%.Figure 3 reports the spread of each approach on public datasets by fixing / = 5% and varying .Regarding MG-Greedy and RR-Greedy, RR-Greedy is slightly better than MG-Greedy, and both greedy solutions are superior to other solutions on DNC.Notably, the seeds of RR-Greedy infect 26.8% more PPs than Degree when  = 10.Regarding the scalable solutions, we find that their seed      qualities are comparable to greedy solutions and better than local competitors.For instance, RR-OPIM+ can improve the local heuristics by up to 39% on Orkut.Regarding the local competitors, they are inferior to all greedy solutions and their scalable versions due to the overlapping of seeds mentioned in Section 2.2.In particular, we report the number of distinct seeds returned by different solutions in Figure 4, where the seed size of RR-OPIM+ is at least 3× larger than Degree.Furthermore, we observe that OPIM-C and IMM have better results than the rest local solutions on DNC, showing the usefulness of approximation in the local framework.We also report the spread by fixing  = 10 and varying / ∈ {5, 10, 20, 50}‰.As shown in Figure 5, the results under different  settings have the same tendency as those under different  settings.Efficiency analysis.We next compare the running time of each solution.Here, we ignore the running time of Degree, which can be recorded while reading the graph.As illustrated in Figure 6 and Figure 7, the scalable solution RR-OPIM+ outperforms other solutions in all cases.Most notably, RR-OPIM+ improves RR-OPIM and MG-OPIM by two orders of magnitude on Twitch, which signifies the superiority of the proposed tightened bound.Furthermore, we find that RR-OPIM costs less time than MG-OPIM, which spends more time in seed selection and requires more iterations.For instance, MG-OPIM is about 2× slower than RR-OPIM on Orkut when  ≥ 10.For greedy solutions, due to the inefficiency of MC simulations, both solutions are only feasible on DNC but fail on the rest.Notice that the running time of MG-Greedy is slightly better than RR-Greedy.This might be caused by the pruning procedure in CELF.We also evaluate the running time on a smaller co-authorship graph with 1.5 nodes and 2.7 edges [24], where

DEPLOYMENTS
We have deployed RR-OPIM+ on an incentive propagation event of Tencent's battle royale game X with 88.2 million quarter-active users and 3.2 billion relationships, whose procedure follows the description in Section 2.3.This deployment is conducted on an in-house cluster consisting of hundreds of machines, each of which has 16GB memory and 12 Intel Xeon Processor E5-2670 CPU cores.
In the friendship network of X, the weight of each edge is described by the intimacy score, which records the number of historical interactions from one to the other, e.g., co-playing, gifting, and so on.We implement our proposal by first invoking RR-OPIM+ to select  passive seeds for each AP and then ranking each seed in the descending order of its intimacy value with the AP.This ensures the interaction willingness between APs and the selected seeds.For the sake of fairness, we compare the performance of our proposal with the strategy Intimacy, which directly ranks friends by intimacy scores and is widely accepted by in-game friend ranking [57].Each approach is initially computed based on the subgraph instance ahead of the event and is then updated daily by using the latest graph snapshot.The APs are selected according to the active week, returning 373.46 thousand and 382.52 thousand APs for our proposal and Intimacy, respectively.Due to the network effect, we        follow [41] and partition all users into communities with high connectivity and feature similarity.We then conduct the online A/B testing that randomly assigns the live traffic in the same community to the treatment (i.e., ours) or control (i.e., Intimacy) group.When the event ended, we evaluate the performance based on the total spread, i.e., the number of APs and seeds engaged in seed activation, as well as the activated PPs in daily contamination, which is 60.69 thousand for the treatment group and 58.28 thousand for the control group.This improvement is statistically significant, and we refer interested readers to Appendix A. We also evaluate the playing hours of the total spread and break them down based on roles, i.e., AP, seed, and PP.As shown in Figure 9(a), we can find that the treatment group yields more playing hours for all user roles.Notably, the treatment group improves the control group by 2.33%, 5.39%, and 4.56% for APs, seeds, and PPs, respectively.In addition, we measure the distribution of each active user's playing hours.Figure 9(b) reports the distribution of playing hours ranging from 1 to 10.As shown in Figure 9(b), we can observe that (i) for both the treatment and control groups, the number of active users decreases as the playing hour increases, and (ii) our proposal attracts more users than the baseline in all playing hours.It is worth noting that a similar result can be found for the remaining hours.

CONCLUSIONS
Motivated by in-game insights, we present CIM and offer two greedy solutions MG-Greedy and RR-Greedy.For the sake of scalability, we further design an approximated algorithm RR-OPIM+ with nearlinear running time.We conduct extensive experiments to demonstrate the superiority of our proposal in terms of effectiveness and efficiency.In addition, we deploy RR-OPIM+ to the in-game incentive propagation scenario, achieving considerable improvement.As a future direction, we will aim to improve the approximation ratio of our proposal and consider the dynamic setting.

ETHICS STATEMENT
While the proposed CIM problem and its solutions may boost user engagement and revenue for the online platform, they also pose privacy risks due to the collection of user data and may expose users to negative side effects, such as addiction.As researchers, we recognize the importance of balancing the potential benefits of studying these algorithms with the potential risks to user well-being.To ensure privacy and confidentiality, we have strictly anonymized the data used in our study and have no access to detailed user profiles.Additionally, we have followed the ethical guidelines [37] set forth by Tencent Inc. in conducting this research.

A APPENDIX A.1 Proofs
Proof of Corollary 1.For any graph  = ( , ), we first construct an extended graph  ′ by adding a node  to  and directed edges  , for all  ∈  to .Notice that the IM problem on  is NP-hard [22] and is the special case of the CIM problem on  ′ with  = { }.Hence, the CIM problem is also NP   In what follows, we mainly utilize two martingale-based concentration bounds from [49], which intuitively describes how close between Λ R (•) and   •  (•) for a given number of random RR sets.Lemma A. 1 ([49]).Given a graph  with  nodes, a set R of random RR sets with |R| =  and a seed set , for any  > 0, Proof of Lemma 5.1.Recall that   • Λ R ( S)  is an unbiased estimator for  (S), where |R| =  .Hence, a large enough  can make this estimator close to  (S).This connection is shown by the following lemma.Lemma A.2 ( [49]).Given a graph  with   nodes, constants  1 Combining Corollary 2 and Lemma A.2, we can derive that following Eq.( 9) hold with probability at least 1 −  1 if  ≥  1 : •  (S * ).
We now need the following lemma to connect  (S) with  (S * ).
Since  ≤  (S * ) and   ≥ max( 1 ,  2 ) by definition, this lemma holds with probability at least 1 −/3 if  >   , which completes the proof.□ Proof of Lemma A.3.Given a graph  with   nodes and any S ′ ∈ I, we say S ′ is bad if  (S ′ ) < ( 1 2 − ) •  (S * ).Based on Eq. ( 7), for a bad S ′ , we have Since there exists at most  ∈ |  |  possible S ′ , by the union bound and Eq. ( 9), we obtain that, for any S ′ returned by Algorithm 3, The proof of the lower bound   (S) can be found in Lemma 4.2 of [46].We show the proof here to make it self-contained.Specifically, by Eq. ( 7) in Lemma A.1, we have where the first two inequalities are derived from the monotonicity and submodularity of Λ R ( Combining Corollary 3 with Wald's equation [52] and the fact that the expected time of generating one RR set is bounded by , where  * is the node with the largest spread in  [50], we can derive that the time complexity for RR set generation is Regarding the rest subroutines in each iteration, RR-Greedy costs   ∈ R 1 | ∩  | to scan all nodes  in .Moreover, RR-Greedy requires an additional time of  ( • |C|) for the computation of   (S * ).Specifically, for each 0 time to choose the nodes in Φ  (S, ) for each  ∈ .Akin to the seed selection in RR-Greedy, the overhead for computing   (S) is based on Wald's equation [52].Note that the number of iterations   is at most Furthermore, let R 1 and R 2 are the sets of RR sets generated in the last iteration, then the total number of RR sets generated is no more than twice of |R 1 ∪ R 2 |.Therefore, the time complexity of rest subroutines in all iterations is , where   is a node selected uniformly at random from .
Based on Lemma A.4 and the above-said Corollary 3, the overall complexity of rest subroutines is Combining Eq. (10)         The values of   .Table 3 shows the values of   and the approximation ratio as  increases on DNC as mentioned in Section 4.2.
We conduct  = 20, 000 MC simulations for spread estimation, with relative errors small enough to be considered as ground-truth.As  increases, the candidate space C is enlarged and the approximation ratio of RR-Greedy is close to 1/2.
Selecting .We focus on understanding the correlation between the user's feature and the engagement of APs, which might be helpful for generating APs.Specifically, we employ a general feature called active week, and the active week  of a user  represents that  keeps playing the game in the latest  weeks.and in the treatment group (  ) as   /  = 0.529%.This demonstrates an 8.33% improvement in the treatment group compared to the control group.To assess the statistical significance of this improvement, we conducted the Z-test, a widely applied hypothesis test for discerning significant differences between two population means, especially in large populations, and A/B testing contexts for recommendation strategies' outcomes [56].As a result, we obtained a Z-score of 13.825 using this formula: The corresponding two-tailed p-value is less than 0.01, signifying that the improvement is statistically significant.

Figure 1 :
Figure 1: The distribution of one-hop spread of APs and their seeds and the number of in-neighbor APs of seeds on TXG.

Definition 3 . 1 (
Spread of Passive Participants).Given a graph  = ( , ), a model M, and an AP set , let  = (  ,   ) be the graph of passive participants, where   =  \ and   = { , ∈  :  ∉ ,  ∉ }.For any node set  =  ∈   ⊂   , the spread of passive participants  ,M () is defined as the expected number of nodes on  activated by  under M.
we have Pr  (S) <   (S) <   and Pr [ (S * ) >   (S)] <   .In Lemma 5.2, the derivation of   (S * ) requires an upper bound of the coverage Λ R 1 (S * ) of the unknown optimal set S * .To this end, we need the following corollary in terms of Theorem 4.3, since Λ R (•) is also non-decreasing and submodular.Corollary 2. Let S * be the optimal solution of CIM.RR-Greedy with Λ R (•) outputs an S with Λ R (S) ≥ 1 1+ Λ R (S * ) ≥ 1 2 Λ R (S * ).Accordingly, Lemma 5.2 employs 2 • Λ R 1 (S) as a vanilla upper bound of Λ R 1 (S * ), which might be loose in practice and motivates us to design a tightened upper bound of Λ R 1 (S * ) as follows.Lemma 5.3.For any seed set S under partition matroid, i.e., |S  | ≤ , ∀ ∈  and any set R of random RR sets, Λ R (S * ) ≤ Λ  R (S * ) = Λ R (S) +  ∈  , ∈Φ  ( S, ) Λ R ( , |S), where Φ  (S, ) denotes the set of at most  AP-seed pairs in C  with the  largest coverage gain on R w.r.t. S.

Figure 2 :
Figure 2: The RMSE of estimating spreads in each hop.The distribution of the rank of all invited passive seeds on TXG.

Figure 2 (
b) reports the distribution of the ranks of all invited seeds.Notably, the AP prefers inviting the seeds in the top ranks.For instance, 80.1% of seeds are invited when they rank in the top 20.We also analyze the rank distribution w.r.t. the overlapped and invited seeds, which is similar to Figure2(b), indicating the overlapping phenomenon is ubiquitous in the top positions.

Figure 8
reports the time-spread curve by fixing  = 10, / = 5% and varying  ∈ {0.1, 0.05, 0.02, 0.01}.The results are sorted in the descending order of , where RR-OPIM and MG-OPIM has a timeout when  = 0.01 in Figure 8(b) and they only have one point when  = 0.1 in Figure Figure 8(c).As shown, the running time and spread of RR-OPIM+ are fast and robust when  varies.In contrast, the running time of RR-OPIM and MG-OPIM is more sensitive to .In particular, as  is halved, the running time of RR-OPIM and MG-OPIM increases by 2 × −10×.

Figure 8 :
Figure 8: Running time vs. spread by varying .

Figure 9 :
Figure 9: The playing hours of active users in each role and the distribution of playing hours of each active user on X.

Figure 12 :
Figure 12: Running time vs. spread by varying .
, ← arg max   ′ , ′ ∈ C\S  (  ′ , ′ |S); Notice that MG-Greedy yields a 1/2-approximate output by iteratively selecting only one AP-seed pair after evaluating the marginal gain of  (|C|) pairs.To improve the result quality and reduce invocation times of spread functions, we propose a greedy algorithm called RR-Greedy, ensuring an at least 1/2-approximate result.The main idea is selecting the edge from the local candidate space C 1  2•  (S * ).Algorithm 1: MG-Greedy (, , , )1 S ← ∅; C ←  ∈ C  ;   ← 0, ∀ ∈ ; 2 while C\S ≠ ∅ do 3 4 S ← S ∪ { , };   ←   + 1; 5 if   ≥  then C ← C\C  ; 6 return S;RR-Greedy. for each AP .Let  ⊆  be the set where each  ∈  has selected less than  seeds.As illustrated in Algorithm 2, at each iteration, RR-Greedy selects the edge  , with the largest marginal gain  ( , |S) from C  for each  ∈  (Lines 3-4), where ties are settled arbitrarily.RR-Greedy adds the selected  , to S, and removes  from  if it has enough seeds or no more candidates exist (Lines 5-6).RR-Greedy accesses the spread oracle  ( • |C|) times, improving MG-Greedy by  (), and its correctness is as follows.

Table 2 :
The actual spread on TXG.
3. Given a graph  with   nodes, constants  2 =  − ) •  (S * ) ≥ 1 −  2 .According to Lemmas A.2-A.3 and the union bound, if  ≥ max( 1 ,  2 ), holds with the probability at least 1 −  1 −  2 .By setting 1 2 − ) •  (S * ) with the probability at most  2 , which completes the proof.□ Proof of Lemma 5.2.Regarding the upper bound   (S * ), in terms of Corollary 2 and Eq.(8) in Lemma A.1, we have •), respectively.Since for each  ∈ , the number of  , ∈ S *  \S  does not exceed , the above inequality can be further derived as Λ Proof of Theorem 5.5.Recall that RR-OPIM+ consists of (i) RR set generations, (ii) the node selection by RR-Greedy with the computation of   (S * ) and   (S) in each iteration.In what follows, we analyze the time complexity of each subroutine in   iterations.Regarding the RR set generations, we follow Lemma 6.2 in [46] and provide the following corollary about the expected number of generated RR sets.Corollary 3.With  ≤ 1/2, RR-OPIM+ generates an expected number of   •  • ln |C| + ln 1    •  −2 / (S * ) RR sets.

Table 3 :
and approximate ratio with different  on DNC.

Table 4 :
The fraction of engaged APs in the same active week.
Table4reports the fraction of engaged APs over APs with the same active week, in which APs with the larger active weeks are more likely to join this event.Specifically, the Pearson correlation coefficient between the active week and AP engagement is 0.792, indicating these two features are strongly positive-correlated.Therefore,  can be synthesized if users' historical activeness exists.Performance evaluation on DNC and Blog.Figures10-12show the results on DNC and Blog in terms of result quality, running time, and the error constant .The results share the same tendency as those on Twitch, Orkut, and Twitter, illustrating that RR-OPIM+ outperforms other solutions as  varies.In addition, we conduct the sensitivity analysis w.r.t. the influence probability in the IC model.Specifically, we evaluate the performance of our proposed RR-OPIM+ and its two variants (RR-OPIM and MG-OPIM) by setting a uniform probability of  , on each edge and varying  , ∈ {0.01, 0.04, 0.07, 0.1} while keeping other parameters at default values ( = 0.1,  = 10,  = 5%).Figure13reports the time-spread curve on DNC and Blog, showing that spread increases as  , increases.The reason is that increasing  , allows more nodes to be activated, resulting in more spread.In terms of running time, as the value of  , increases, more nodes are included in the RR set.However, fewer RR sets are needed to achieve the same approximation ratio.Consequently, constructing a single RR set may take more time, but the overall running time could be reduced.Significance analysis.As explained, prior to the event, we first partitioned the social network into different clusters to mitigate network interference, and then randomly assigned clusters of users to the control and treatment groups, resulting in 11,927,377 control group users (  ) and 11,465,639 treatment group users (  ).Based on each group's total population, we selected the top 3% of users as APs according to their historical activeness and conducted the A/B test.Upon the event's conclusion, 58,280 users in the control group (  ) and 60,691 users in the treatment group (  ), comprising APs, seeds, and other activated PPs, participated and earned shared incentives.Using these statistics, we calculated the engagement probability for users in the control group (  ) as   /  = 0.489%