December 7-10, 2013, Dallas, Texas

Slides:



Advertisements
Similar presentations
Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.
Advertisements

Competition in VM – Completing the Circle. Previous work in Competitive VM Mainly follower’s perspective: given state (say of seed selection) of previous.
Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
Viral Marketing – Learning Influence Probabilities.
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Lauritzen-Spiegelhalter Algorithm
Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.
Spread of Influence through a Social Network Adapted from :
Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.
DAVA: Distributing Vaccines over Networks under Prior Information
Maximizing the Spread of Influence through a Social Network
Suqi Cheng Research Center of Web Data Sciences & Engineering
Guest lecture II: Amos Fiat’s Social Networks class Edith Cohen TAU, December 2014.
In Search of Influential Event Organizers in Online Social Networks
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
Frequent Subgraph Pattern Mining on Uncertain Graph Data
The Cache Location Problem IEEE/ACM Transactions on Networking, Vol. 8, No. 5, October 2000 P. Krishnan, Danny Raz, Member, IEEE, and Yuval Shavitt, Member,
Influence Maximization
Simpath: An Efficient Algorithm for Influence Maximization under Linear Threshold Model Amit Goyal Wei Lu Laks V. S. Lakshmanan University of British Columbia.
Maximizing Product Adoption in Social Networks
Models of Influence in Online Social Networks
On Ranking and Influence in Social Networks Huy Nguyen Lab seminar November 2, 2012.
Active Learning for Networked Data Based on Non-progressive Diffusion Model Zhilin Yang, Jie Tang, Bin Xu, Chunxiao Xing Dept. of Computer Science and.
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
Personalized Influence Maximization on Social Networks
Jure Leskovec PhD: Machine Learning Department, CMU Now: Computer Science Department, Stanford University.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
Information Spread and Information Maximization in Social Networks Xie Yiran 5.28.
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
Stochastic DAG Scheduling using Monte Carlo Approach Heterogeneous Computing Workshop (at IPDPS) 2012 Extended version: Elsevier JPDC (accepted July 2013,
Maximizing the Spread of Influence through a Social Network David Kempe, Jon Kleinberg, Eva Tardos Cornell University KDD 2003.
Maximizing the Spread of Influence through a Social Network Authors: David Kempe, Jon Kleinberg, É va Tardos KDD 2003.
A Local Seed Selection Algorithm for Overlapping Community Detection 1 A Local Seed Selection Algorithm for Overlapping Community Detection Farnaz Moradi,
Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.
Online Social Networks and Media
Lecture 3-1 Independent Cascade Weili Wu Ding-Zhu Du University of Texas at Dallas.
On Bharathi-Kempe-Salek Conjecture about Influence Maximization Ding-Zhu Du University of Texas at Dallas.
1 Latency-Bounded Minimum Influential Node Selection in Social Networks Incheol Shin
Algorithms For Solving History Sensitive Cascade in Diffusion Networks Research Proposal Georgi Smilyanov, Maksim Tsikhanovich Advisor Dr Yu Zhang Trinity.
Cost-effective Outbreak Detection in Networks Presented by Amlan Pradhan, Yining Zhou, Yingfei Xiang, Abhinav Rungta -Group 1.
Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.
1 1 MPI for Intelligent Systems 2 Stanford University Manuel Gomez Rodriguez 1,2 Bernhard Schölkopf 1 S UBMODULAR I NFERENCE OF D IFFUSION NETWORKS FROM.
Instructor: Shengyu Zhang 1. Location change for the final 2 classes Nov 17: YIA 404 (Yasumoto International Academic Park 康本國際學術園 ) Nov 24: No class.
1 1 Stanford University 2 MPI for Biological Cybernetics 3 California Institute of Technology Inferring Networks of Diffusion and Influence Manuel Gomez.
A Connectivity-Based Popularity Prediction Approach for Social Networks Huangmao Quan, Ana Milicic, Slobodan Vucetic, and Jie Wu Department of Computer.
Biao Wang 1, Ge Chen 1, Luoyi Fu 1, Li Song 1, Xinbing Wang 1, Xue Liu 2 1 Shanghai Jiao Tong University 2 McGill University
Yu Wang1, Gao Cong2, Guojie Song1, Kunqing Xie1
Inferring Networks of Diffusion and Influence
Efficient Influence Maximization in Large-scale Social Networks
Nanyang Technological University
Independent Cascade Model and Linear Threshold Model
Heuristic & Approximation
Greedy & Heuristic algorithms in Influence Maximization
MEIKE: Influence-based Communities in Networks
A Study of Group-Tree Matching in Large Scale Group Communications
Influence Maximization
Independent Cascade Model and Linear Threshold Model
Influence Maximization
The Importance of Communities for Learning to Influence
Discovering Functional Communities in Social Media
Cost-effective Outbreak Detection in Networks
Asymmetric Transitivity Preserving Graph Embedding
Expectation-Maximization & Belief Propagation
Influence Maximization
Viral Marketing over Social Networks
Discovering Influential Nodes From Social Trust Network
Independent Cascade Model and Linear Threshold Model
Lecture 2-6 Complexity for Computing Influence Spread
Presentation transcript:

December 7-10, 2013, Dallas, Texas IEEE ICDM 2013 UBLF:An Upper Bound Based Approach to Discover Influential Nodes in Social Networks Authors: C. Zhou, P. Zhang, J. Guo, X. Zhu, L. Guo Presenter: Peng Zhang, Chinese Academy of Sciences December 7-10, 2013, Dallas, Texas

Content Background Problem Formulation Related work Our solution Experiments Conclusion

Background Social networks are popularly used Influence propagation Viral marketing Information dissemination Technology/Idea transfers Influence propagation Influence maximization Community detection Influence inference Early warning of public opinion Link Prediction/Friends Recommendation Partner Recommendation/Social Cooperation/Team Formation Social networks are widely used in recent years. It has been one of the most effective method to propagation information and to market and advertise products. Today, more and more people now are using social networks to share ideas, their opinions and advertise their products. In these applications, the key problem is to find the most influential nodes in social networks. If we can find the most influential nodes, then the influence of our ideas, opinions, and products can be maximized.

Problem Formulation Challenges: Given a directed social graph G=(V,E), a budget k, and a stochastic propagation model M, finding k nodes, such that the expected spread of the influence can be maximized [Kemp KDD’03] Challenges: How to measure the objective function M(S) ? How to find the optimal solution, i.e., the subset k of the most influential nodes? Discovering influential nodes in social networks has been studied for a long time. Ten years ago, Kemp first formulated the problem as a combinatorial optimization problem. Given a grph G, and a propagation model M, and the size of node k, he defines the problem of maximizing the informantion spread, denoted by sigma_m(S), subject to the constraints that S in V, and |S|=k. In this problem formulation, there are two challenges. The first one is that \sigma_M(s) is very hard to estimate, because given a set of nodes in S, the propagation is very hard to track, especially when the graph is large. Second, hwo to find the optimal solution of the subset k is also very hard to solve, actually, this problem has been proved to be equivalent to the set cover problem, and is a Np-hard problem

Problem Formulation How to measure the influence M(S) ? .3 How to measure the influence M(S) ? Stochastic propagation models IC model LT model Other propagation models: e.g. continuous time IC or LT models Monte Carlo (MC) simulation Exact calculation under IC and LT is #P-hard (Chen, KDD’ 10). .1 c .3 .1 .2 .1 a e .3 .4 f .2 .4 .1 .4 .3 h d .1 .2 .1 .2 g .4 I .4 .1 IC propagation model Now let’s look at the first challenge, in order to measure the objective function, we need to select a propagation model M first. Two popularly used stochastic influence propagation models are the Independent Cascade (IC) and Linear Threshold (LT) models. There are also many other propagation models. Is it hard to say which models always beat others, so in this paper, we choose the IC model as the basic model. In IC model, at any time step, a user is represented as a binary variable with either active or inactive status, and influence propagates until no more users can become active. In the IC model when an inactive user becomes active at a time step t, it has exactly one chance to independently activate its currently inactive neighbors at the next time step t + 1. For example, see right demo…. Chen pointed out that computing M(S) is #P-hard, by showing a reduction from the counting problem of s-t connectness in a graph. After choosing the model, then we can use monte carlo to simulate the objective function. We use monte carlo because it is very hard to exactly calculate the influence. We can see the figure, a is the source and c is the target node, there many pathes from a to c, and when the graph is very large, the path are especially large and impossible to exact the influence. #P-hard

Greedy Algorithm How to find a subset k containing the most influential nodes Influence maximization under both IC and LT models is NP-hard . (Kemp, KDD’03) Property 1: M(S) is monotone: Property 2: M(S) is submodular : The set cover problem As Influence maximization problem can be converted as a set cover problem, so it is NP hard. Fortunately we can get that the spread function M(S) is monotone and submodular. Intuitively, submodularity indicates thatM(S) has diminishing margin returns when adding more nodes into the set.

Greedy Algorithm Advantage: Performance guarantee of 1− 1/e =63% Disadvantage: Heavy computation cost Inner loop: M(S) needs many Monte-Carlo simulations Outer loop:time complexity of O(Nk), where N is network size Exploiting these two properties, Kempe et al. presented a simple greedy algorithm which repeatedly chooses the node with the maximum marginal gain and adds it to the seed set, until the budget k is reached. The advantage is the reedy algorithm can approximate the solution within a factor of 1 − 1/e. Unfortunately, the simple greedy algorithm suffers from two major sources of inefficiency. (I) The MC simulations that run sufficiently many times (typically 10,000) to obtain an accurate estimate of spread, has been proved computationally expensive especially for large networks. (II) The greedy algorithm calls for O(kN) iterations at the spread estimation step, where k is the size of initially picked seed set, and N is the number of nodes. When N is large, the efficiency of the algorithm is unsatisfactory.

Improvement direction (I): Heuristic algorithms ShortestPath: Kimura and Saito (PKDD’06) “Tractable models for information diffusion in social networks” DegreeDiscount: Chen et al. (KDD'09) “Efficient influence maximization in social networks” MIA: Chen et al. (KDD'10) “Scalable influence maximization for prevalent viral marketing in large-scale social networks” DAG: Chen et al. (ICDM’10) “Scalable influence maximization in social networks under the linear threshold model” SIMPATH: Goyal et al. (ICDM’11)“SIMPATH: An Efficient Algorithm for Influence Maximization under the Linear Threshold Model” d e f g Shortest Path from a to c Node 2’s degree will shrink To address these limitations, many heuristic solutions have been proposed to improve the efficiency of seed selection, e.g., ShortestPath, DegreeDiscount, MIA, DAG, SIMPATH. The heuristic algorithms proposed in these works can reduce computational cost in orders of magnitude, with competitive results of the influence spread level. However, none of them has a theoretical guarantee on the reliability of the results. In other words, it is unknown how far these heuristic solutions approximate the optimal solution. One can only borrow the simple greedy algorithm as the benchmark for performance testing. 2 Advantage: faster than the Greedy algorithm Disadvantage: no performance guarantee 5 DegreeDiscount

Improvement direction (II): Advanced greedy Advanced greedy algorithms CELF: Leskovec et al. (KDD'07) “Cost-effective outbreak detection in networks” Goyal et al. (WWW’11) “CELF++: optimizing the greedy algorithm for influence maximization in social networks” Greedy algorithm reward a d b b To accelerate the original greedy algorithm, a representative work, by Leskovec et al. exploited the submodular property of the objective function, and proposed a Cost-Effective Lazy Forward selection (CELF) algorithm. The algorithm can significantly reduce the number of MC simulation calls in the simple greedy algorithm. The principle behind is that the marginal gain of a node in the current iteration cannot be more than that in previous iterations, and thus the number of spread estimation calls can be greatly pruned. Leskovec et al. [16] reported that CELF improves the running time of the simple greedy algorithm by up to 700 times. a c e c d e

Improvement direction (II): Advanced greedy Advanced greedy algorithms CELF: Leskovec et al. (KDD'07) “Cost-effective outbreak detection in networks” Goyal et al. (WWW’11) “CELF++: optimizing the greedy algorithm for influence maximization in social networks” Greedy algorithm reward a d b b a c e c d e

Improvement direction (II): Advanced greedy Advanced greedy algorithms CELF: Leskovec et al. (KDD'07) “Cost-effective outbreak detection in networks” Goyal et al. (WWW’11) “CELF++: optimizing the greedy algorithm for influence maximization in social networks” Greedy algorithm CELF algorithm reward reward d a d a b b b a b a c e c e c c d d e e

Improvement direction (II): Advanced greedy Advanced greedy algorithms CELF: Leskovec et al. (KDD'07) “Cost-effective outbreak detection in networks” Goyal et al. (WWW’11) “CELF++: optimizing the greedy algorithm for influence maximization in social networks” Greedy algorithm CELF algorithm reward reward d a d a b b b b a a c e c e c c d d e e

Improvement direction (II): Advanced greedy Advanced greedy algorithms CELF: Leskovec et al. (KDD'07) “Cost-effective outbreak detection in networks” Goyal et al. (WWW’11) “CELF++: optimizing the greedy algorithm for influence maximization in social networks” Greedy algorithm CELF algorithm reward reward d a d a b b b d a a Advantage: by setting up an upper bound, CELF reduces the Monte-Carlo calls and improves the greedy algorithm by up to 700 times Disadvantage: needs N Monte Carlo simulations to initialize the upper bound, where N is the network size. b e c e c c e d c e

Our work Motivation Can we initialize the upper bounds without actually computing the MC simulations ? CELF algorithm UBLF algorithm UBLF algorithm Node upper bound MC a 2.1 1 b 1.5 c 1.1 d 1.8 e 1.2 Node Upper bound MC a 2.3 b 1.7 c 1.2 d 1.8 e Although CELF significantly improves the the simple greedy algorithm, its running time is still quite slow on large Networks. In particular, in the initialization step, CELF needs to estimate the spread using Monte-Carlo for each node in a network, resulting in N times of Monte-Carlo calls (N is the total number of nodes in the network), which is time-consuming, especially when the network is very large. The limitation leads to a rather fundamental question that, can we derive an upper bound of spreads which can be used to prune unnecessary spread estimations (Monte-Carlo calls) in the CELF algorithm? To the best of our knowledge, there is no work in the literature that mathematically discuss the upper bound properties of the spread function.

The upper bound of M(S) Local view Global view Proposition 1 reveals that we can treat the global influence measure σI (S) as a summation of local propagation probabilities. Namely we can break up the whole into parts. Proposition 2 clearly identifies the ordering relationship between two adjacent elements in the series of propagation probabilities. Actually this provides a basic for assembling the parts into a whole. How many heads? Proposition 2 establishes a relationship among the activation probabilties in time t and t+1.

The upper bound of M(S) M(S) is bounded by a sum of series. In what condition the series convergent? and what is the limit? Theorem 1 tells that we can figure out the upper bound of spread function σI (S), although the exact values are #P hard to compute. One should note that the upper bound means hat the expected value of diffusion range is less than the upper bound, rather than that every possible diffusion range is less than it. We see that,M(S) is bounded by a sum of series. The next problem is: in what condition the series will be convergent as N trends to infinity? and what is the limit? Too hard! Its aera? But we know its upper bound!

The upper bound of M(S) Corollary 1 tell that, under the condition (14), we can obtain a simple and tractable upper bound like Eq. (15). Convergent condition:the total influence to or from any node is less than 1. Under condition (14), we get a tractable upper bound. +……=

Our UBLF algorithm CELF: the first round is time-consuming, needs full MC simulations. UBLF: the first round is analytical calculated. CELF demands N spread estimations to establish the initial bounds of marginal increments, which is time expensive on large graphs. In our Upper Bound based Lazy Forward (UBLF) algorithm, we use the derived upper bound to further reduce the number of spread estimations in the initialization step. By doing so, the nodes will be all ranked by their upper bound scores, which can potentially reduce the computational cost of the original CELF algorithm. We use an example for an illustration.

Our work: An example for UBLF Given a graph G, with propagation probability matrix PP, Based on Corollary 1, the upper bound of spread with seed 1 is 1.3911. Same argument, the other three nodes’upper bounds are 1.3417, 1.2278, 1.1391 respectively. Obviously, the upper bound of node 1⃝ ,1.3911 , is the largest in the graph. Thus, we use Monte-Carlo simulation to estimate node 1’s spread and get the true value is 1.3788 Now, we can observe that 1.3788 is already larger than the upper bounds of σI(2⃝), σI(3⃝)and σI(4⃝. Thus, we do not need extra Monte-Carlo simulations to estimate the other three nodes, and node 1⃝ is the node with the maximal influence in the graph. Monte-Carlo Simulation Node 1 is selected! (only 1 time MC simulation)

Experiments Data collection Benchmark Ca-GrQc Digger Ca-HepPh Email-Enron Benchmark CELF Degree DegreeDiscount PageRank Random Statistics of datas We use four real-world data sets for testing and comparisons. We implement other five algorithms, CELF, DEGREE, PAGERANK, DEGREEDISCOUNT and RANDOM, for comparisons. The Platform is….

Experiments Comparison results (Numbers of MC simulations) From the results, we can observe that the number of Monte-Carlo calls in UBLF is significantly reduced, compared to that in CELF, especially in the first two iterations. Observation: The total MC calls of UBLF is significantly reduced compared to CELF.

Experiments Comparison results (Influence spread) Observations: The spreads of UBLF and CELF are completely identical, which explains again that UBLF and CELF share the same logic in selecting nodes. Influence spread is an important measure for comparisons. In this part, we run tests on the four data sets to obtain influence spread results w.r.t. parameter k (the seed set size), where k increases from 1 to 50. Above 1 and 2 in right red frame.

Experiments Comparison results (Time cost) Observation: This figure shows time costs of selecting 10 seeds. We can observe that UBLF is 2-5 times faster than CELF. Observation: UBLF is 2-5 times faster than CELF.

Conclusions Problem Formulation Background Greedy Algorithm Heuristic algorithms: DegreeDiscount, PageRank, et al. Advanced greedy algorithms: CELF, CELF++ UBLF Comparisons

Questions ? Q: How do you get the upper bound? In which mwthod? A: We analysis the spread function in the set of IC model, especially its stochastic structure in the language of probability theory. We can break up the whole into parts, analyze these parts, and then assemble the parts into a whole. Q: How the upper bound can be used to design new algorithm? A: We use the upper bound into the classical CEL Q: How do you get the upper bound? In which mwthod? A: We analysis the spread function in the set of IC model, especially its stochastic structure in the language of probability theory. We can break up the whole into parts, analyze these parts, and then assemble them into a whole. Q: How the upper bound can be used to design new algorithm? A: We apply the upper bound to the classical CELF algorithm, and optimize the greedy algorithm from the first iteration to get rid of heavy Monte-Carlo simulations. Q: Do you have similar upper bounds in the set of other propagation models? A: I think so. This is also our future work. Email: zhangpeng@iie.ac.cn