December 7-10, 2013, Dallas, Texas

December 7-10, 2013, Dallas, Texas
IEEE ICDM 2013 UBLF：An Upper Bound Based Approach to Discover Influential Nodes in Social Networks Authors: C. Zhou, P. Zhang, J. Guo, X. Zhu, L. Guo Presenter: Peng Zhang, Chinese Academy of Sciences December 7-10, 2013, Dallas, Texas

Content Background Problem Formulation Related work Our solution
Experiments Conclusion

Background Social networks are popularly used Influence propagation
Viral marketing Information dissemination Technology/Idea transfers Influence propagation Influence maximization Community detection Influence inference Early warning of public opinion Link Prediction/Friends Recommendation Partner Recommendation/Social Cooperation/Team Formation Social networks are widely used in recent years. It has been one of the most effective method to propagation information and to market and advertise products. Today, more and more people now are using social networks to share ideas, their opinions and advertise their products. In these applications, the key problem is to find the most influential nodes in social networks. If we can find the most influential nodes, then the influence of our ideas, opinions, and products can be maximized.

Problem Formulation Challenges:
Given a directed social graph G=(V,E), a budget k, and a stochastic propagation model M, finding k nodes, such that the expected spread of the influence can be maximized [Kemp KDD’03] Challenges: How to measure the objective function M(S) ? How to find the optimal solution, i.e., the subset k of the most influential nodes? Discovering influential nodes in social networks has been studied for a long time. Ten years ago, Kemp first formulated the problem as a combinatorial optimization problem. Given a grph G, and a propagation model M, and the size of node k, he defines the problem of maximizing the informantion spread, denoted by sigma_m(S), subject to the constraints that S in V, and |S|=k. In this problem formulation, there are two challenges. The first one is that \sigma_M(s) is very hard to estimate, because given a set of nodes in S, the propagation is very hard to track, especially when the graph is large. Second, hwo to find the optimal solution of the subset k is also very hard to solve, actually, this problem has been proved to be equivalent to the set cover problem, and is a Np-hard problem

Problem Formulation How to measure the influence M(S) ?
.3 How to measure the influence M(S) ? Stochastic propagation models IC model LT model Other propagation models: e.g. continuous time IC or LT models Monte Carlo (MC) simulation Exact calculation under IC and LT is #P-hard (Chen, KDD’ 10). .1 c .3 .1 .2 .1 a e .3 .4 f .2 .4 .1 .4 .3 h d .1 .2 .1 .2 g .4 I .4 .1 IC propagation model Now let’s look at the first challenge, in order to measure the objective function, we need to select a propagation model M first. Two popularly used stochastic influence propagation models are the Independent Cascade (IC) and Linear Threshold (LT) models. There are also many other propagation models. Is it hard to say which models always beat others, so in this paper, we choose the IC model as the basic model. In IC model, at any time step, a user is represented as a binary variable with either active or inactive status, and influence propagates until no more users can become active. In the IC model when an inactive user becomes active at a time step t, it has exactly one chance to independently activate its currently inactive neighbors at the next time step t + 1. For example, see right demo…. Chen pointed out that computing M(S) is #P-hard, by showing a reduction from the counting problem of s-t connectness in a graph. After choosing the model, then we can use monte carlo to simulate the objective function. We use monte carlo because it is very hard to exactly calculate the influence. We can see the figure, a is the source and c is the target node, there many pathes from a to c, and when the graph is very large, the path are especially large and impossible to exact the influence. #P-hard

Greedy Algorithm How to find a subset k containing the most influential nodes Influence maximization under both IC and LT models is NP-hard . (Kemp, KDD’03) Property 1: M(S) is monotone: Property 2: M(S) is submodular : The set cover problem As Influence maximization problem can be converted as a set cover problem, so it is NP hard. Fortunately we can get that the spread function M(S) is monotone and submodular. Intuitively, submodularity indicates thatM(S) has diminishing margin returns when adding more nodes into the set.

Greedy Algorithm Advantage: Performance guarantee of 1− 1/e =63%
Disadvantage: Heavy computation cost Inner loop： M(S) needs many Monte-Carlo simulations Outer loop：time complexity of O(Nk), where N is network size Exploiting these two properties, Kempe et al. presented a simple greedy algorithm which repeatedly chooses the node with the maximum marginal gain and adds it to the seed set, until the budget k is reached. The advantage is the reedy algorithm can approximate the solution within a factor of 1 − 1/e. Unfortunately, the simple greedy algorithm suffers from two major sources of inefficiency. (I) The MC simulations that run sufficiently many times (typically 10,000) to obtain an accurate estimate of spread, has been proved computationally expensive especially for large networks. (II) The greedy algorithm calls for O(kN) iterations at the spread estimation step, where k is the size of initially picked seed set, and N is the number of nodes. When N is large, the efficiency of the algorithm is unsatisfactory.

Improvement direction (I): Heuristic algorithms
ShortestPath: Kimura and Saito (PKDD’06) “Tractable models for information diffusion in social networks” DegreeDiscount: Chen et al. (KDD'09) “Efficient influence maximization in social networks” MIA: Chen et al. (KDD'10) “Scalable influence maximization for prevalent viral marketing in large-scale social networks” DAG: Chen et al. (ICDM’10) “Scalable influence maximization in social networks under the linear threshold model” SIMPATH： Goyal et al. (ICDM’11)“SIMPATH: An Efficient Algorithm for Influence Maximization under the Linear Threshold Model” d e f g Shortest Path from a to c Node 2’s degree will shrink To address these limitations, many heuristic solutions have been proposed to improve the efficiency of seed selection, e.g., ShortestPath, DegreeDiscount, MIA, DAG, SIMPATH. The heuristic algorithms proposed in these works can reduce computational cost in orders of magnitude, with competitive results of the influence spread level. However, none of them has a theoretical guarantee on the reliability of the results. In other words, it is unknown how far these heuristic solutions approximate the optimal solution. One can only borrow the simple greedy algorithm as the benchmark for performance testing. 2 Advantage: faster than the Greedy algorithm Disadvantage: no performance guarantee 5 DegreeDiscount

Improvement direction (II): Advanced greedy
Advanced greedy algorithms CELF： Leskovec et al. (KDD'07) “Cost-effective outbreak detection in networks” Goyal et al. (WWW’11) “CELF++: optimizing the greedy algorithm for influence maximization in social networks” Greedy algorithm reward a d b b To accelerate the original greedy algorithm, a representative work, by Leskovec et al. exploited the submodular property of the objective function, and proposed a Cost-Effective Lazy Forward selection (CELF) algorithm. The algorithm can significantly reduce the number of MC simulation calls in the simple greedy algorithm. The principle behind is that the marginal gain of a node in the current iteration cannot be more than that in previous iterations, and thus the number of spread estimation calls can be greatly pruned. Leskovec et al. [16] reported that CELF improves the running time of the simple greedy algorithm by up to 700 times. a c e c d e

Advanced greedy algorithms CELF： Leskovec et al. (KDD'07) “Cost-effective outbreak detection in networks” Goyal et al. (WWW’11) “CELF++: optimizing the greedy algorithm for influence maximization in social networks” Greedy algorithm reward a d b b a c e c d e

Advanced greedy algorithms CELF： Leskovec et al. (KDD'07) “Cost-effective outbreak detection in networks” Goyal et al. (WWW’11) “CELF++: optimizing the greedy algorithm for influence maximization in social networks” Greedy algorithm CELF algorithm reward reward d a d a b b b a b a c e c e c c d d e e

Advanced greedy algorithms CELF： Leskovec et al. (KDD'07) “Cost-effective outbreak detection in networks” Goyal et al. (WWW’11) “CELF++: optimizing the greedy algorithm for influence maximization in social networks” Greedy algorithm CELF algorithm reward reward d a d a b b b b a a c e c e c c d d e e

Advanced greedy algorithms CELF： Leskovec et al. (KDD'07) “Cost-effective outbreak detection in networks” Goyal et al. (WWW’11) “CELF++: optimizing the greedy algorithm for influence maximization in social networks” Greedy algorithm CELF algorithm reward reward d a d a b b b d a a Advantage: by setting up an upper bound, CELF reduces the Monte-Carlo calls and improves the greedy algorithm by up to 700 times Disadvantage: needs N Monte Carlo simulations to initialize the upper bound, where N is the network size. b e c e c c e d c e

Our work Motivation Can we initialize the upper bounds without actually computing the MC simulations ? CELF algorithm UBLF algorithm UBLF algorithm Node upper bound MC a 2.1 1 b 1.5 c 1.1 d 1.8 e 1.2 Node Upper bound MC a 2.3 b 1.7 c 1.2 d 1.8 e Although CELF significantly improves the the simple greedy algorithm, its running time is still quite slow on large Networks. In particular, in the initialization step, CELF needs to estimate the spread using Monte-Carlo for each node in a network, resulting in N times of Monte-Carlo calls (N is the total number of nodes in the network), which is time-consuming, especially when the network is very large. The limitation leads to a rather fundamental question that, can we derive an upper bound of spreads which can be used to prune unnecessary spread estimations (Monte-Carlo calls) in the CELF algorithm? To the best of our knowledge, there is no work in the literature that mathematically discuss the upper bound properties of the spread function.

The upper bound of M(S)
Local view Global view Proposition 1 reveals that we can treat the global influence measure σI (S) as a summation of local propagation probabilities. Namely we can break up the whole into parts. Proposition 2 clearly identifies the ordering relationship between two adjacent elements in the series of propagation probabilities. Actually this provides a basic for assembling the parts into a whole. How many heads? Proposition 2 establishes a relationship among the activation probabilties in time t and t+1.

M(S) is bounded by a sum of series. In what condition the series convergent? and what is the limit? Theorem 1 tells that we can figure out the upper bound of spread function σI (S), although the exact values are #P hard to compute. One should note that the upper bound means hat the expected value of diffusion range is less than the upper bound, rather than that every possible diffusion range is less than it. We see that，M(S) is bounded by a sum of series. The next problem is: in what condition the series will be convergent as N trends to infinity? and what is the limit? Too hard! Its aera? But we know its upper bound!

Corollary 1 tell that, under the condition (14), we can obtain a simple and tractable upper bound like Eq. (15). Convergent condition：the total influence to or from any node is less than 1. Under condition (14), we get a tractable upper bound. +……=

Our UBLF algorithm CELF: the first round is time-consuming, needs full MC simulations. UBLF: the first round is analytical calculated. CELF demands N spread estimations to establish the initial bounds of marginal increments, which is time expensive on large graphs. In our Upper Bound based Lazy Forward (UBLF) algorithm, we use the derived upper bound to further reduce the number of spread estimations in the initialization step. By doing so, the nodes will be all ranked by their upper bound scores, which can potentially reduce the computational cost of the original CELF algorithm. We use an example for an illustration.

Our work: An example for UBLF
Given a graph G, with propagation probability matrix PP, Based on Corollary 1, the upper bound of spread with seed 1 is Same argument, the other three nodes’upper bounds are , , respectively. Obviously, the upper bound of node 1⃝ , , is the largest in the graph. Thus, we use Monte-Carlo simulation to estimate node 1’s spread and get the true value is Now, we can observe that is already larger than the upper bounds of σI(2⃝), σI(3⃝)and σI(4⃝. Thus, we do not need extra Monte-Carlo simulations to estimate the other three nodes, and node 1⃝ is the node with the maximal influence in the graph. Monte-Carlo Simulation Node 1 is selected! (only 1 time MC simulation)

Experiments Data collection Benchmark Ca-GrQc Digger Ca-HepPh
-Enron Benchmark CELF Degree DegreeDiscount PageRank Random Statistics of datas We use four real-world data sets for testing and comparisons. We implement other five algorithms, CELF, DEGREE, PAGERANK, DEGREEDISCOUNT and RANDOM, for comparisons. The Platform is….

Experiments Comparison results (Numbers of MC simulations)
From the results, we can observe that the number of Monte-Carlo calls in UBLF is significantly reduced, compared to that in CELF, especially in the first two iterations. Observation: The total MC calls of UBLF is significantly reduced compared to CELF.

Experiments Comparison results (Influence spread) Observations:
The spreads of UBLF and CELF are completely identical, which explains again that UBLF and CELF share the same logic in selecting nodes. Influence spread is an important measure for comparisons. In this part, we run tests on the four data sets to obtain influence spread results w.r.t. parameter k (the seed set size), where k increases from 1 to 50. Above 1 and 2 in right red frame.

Experiments Comparison results (Time cost) Observation:
This figure shows time costs of selecting 10 seeds. We can observe that UBLF is 2-5 times faster than CELF. Observation: UBLF is 2-5 times faster than CELF.

Conclusions Problem Formulation Background Greedy Algorithm
Heuristic algorithms: DegreeDiscount, PageRank, et al. Advanced greedy algorithms: CELF, CELF++ UBLF Comparisons

Questions ? Q: How do you get the upper bound? In which mwthod? A: We analysis the spread function in the set of IC model, especially its stochastic structure in the language of probability theory. We can break up the whole into parts, analyze these parts, and then assemble the parts into a whole. Q: How the upper bound can be used to design new algorithm? A: We use the upper bound into the classical CEL Q: How do you get the upper bound? In which mwthod? A: We analysis the spread function in the set of IC model, especially its stochastic structure in the language of probability theory. We can break up the whole into parts, analyze these parts, and then assemble them into a whole. Q: How the upper bound can be used to design new algorithm? A: We apply the upper bound to the classical CELF algorithm, and optimize the greedy algorithm from the first iteration to get rid of heavy Monte-Carlo simulations. Q: Do you have similar upper bounds in the set of other propagation models? A: I think so. This is also our future work.

December 7-10, 2013, Dallas, Texas

Similar presentations

Presentation on theme: "December 7-10, 2013, Dallas, Texas"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

December 7-10, 2013, Dallas, Texas

Similar presentations

Presentation on theme: "December 7-10, 2013, Dallas, Texas"— Presentation transcript:

Similar presentations

About project

Feedback