Presentation is loading. Please wait.

Presentation is loading. Please wait.

School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems.

Similar presentations


Presentation on theme: "School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems."— Presentation transcript:

1 School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems and Fast Algorithms Danai Koutra U Kang Hsing-Kuo Kenneth Pao Tai-You Ke Duen Horng (Polo) Chau Christos Faloutsos ECML PKDD, 5-9 September 2011, Athens, Greece

2 Problem Definition: G B A techniques Given: graph with N nodes & M edges; few labeled nodes Find: class (red/green) for rest nodes Assuming: network effects ( homophily/ heterophily ) © Danai Koutra - PKDD'11

3 Homophily and Heterophily Step 1 Step 2 All methods handle homophily NOT all methods handle heterophily BUT proposed method does! NOT all methods handle heterophily BUT proposed method does! © Danai Koutra - PKDD'11

4 Why do we study these methods? © Danai Koutra - PKDD'11

5 Motivation (1): Law Enforcement [Tong+ ’06][Lin+ ‘04][Chen+ ’11]… © Danai Koutra - PKDD'11

6 Motivation (2): Cyber Security victims? [ Kephart+ ’95 ] [Kolter+ ’06 ][Song+ ’08-’11][Chau+ ‘11]… botnet members? bot © Danai Koutra - PKDD'11

7 Motivation (3): Fraud Detection Lax controls? [Neville+ ‘05][Chau+ ’07][McGlohon+ ’09]… fraudsters? fraudster © Danai Koutra - PKDD'11

8 Motivation (4): Ranking [Brin+ ‘98][Tong+ ’06][Ji+ ‘11]… © Danai Koutra - PKDD'11

9 Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs © Danai Koutra - PKDD'11

10 Roadmap Background Belief Propagation Random Walk with Restarts Semi-supervised Learning Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions © Danai Koutra - PKDD'11

11 Background Apologies for diversion… © Danai Koutra - PKDD'11

12 Background 1: Belief Propagation (BP) Iterative message-based method st round 2 nd round... until stop criterion fulfilled “Propagation matrix”:  Homophily  Heterophily class of “sender” class of “receiver” Usually same diagonal = homophily factor h Usually same diagonal = homophily factor h “about-half” homophily factor h h = h-0.5 “about-half” homophily factor h h = h © Danai Koutra - PKDD'11

13 Background 1: Belief Propagation Equations [Pearl ‘82][Yedidia+ ‘02] …[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10] © Danai Koutra - PKDD'11

14 Background 2: Semi-Supervised Learning graph-based SSL use few labeled data & exploit neighborhood information STEP1STEP1 STEP1STEP1 STEP2STEP2 STEP2STEP ? ? [Zhou ‘06][Ji, Han ’10]… © Danai Koutra - PKDD'11

15 Background 3: Personalized Random Walk with Restarts (RWR) [Brin+ ’98][Haveliwala ’03][Tong+ ‘06][Minkov, Cohen ‘07]… © Danai Koutra - PKDD'11

16 Background © Danai Koutra - PKDD'11

17 Qualitative Comparison of G B A Methods GBA Method HeterophilyScalabilityConvergence RWR ✗✓✓ SSL ✗✓✓ BP ✓✓ ? F A BP ✓✓✓ © Danai Koutra - PKDD'11

18 Qualitative Comparison of G B A Methods GBA Method HeterophilyScalabilityConvergence RWR ✗✓✓ SSL ✗✓✓ BP ✓✓ ? F A BP ✓✓✓ © Danai Koutra - PKDD'11

19 Roadmap Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions New work Previous work © Danai Koutra - PKDD'11

20 Linearized BP Odds ratio Maclaurin expansions Odds ratio Maclaurin expansions BP is approximated by Theorem [Koutra+] Sketch of proof ? ? d1 d2 d3 d1 d2 d3 final beliefs prior beliefs scalar constants 0.5 pipi 0 “ ” 1 DETAILS! © Danai Koutra - PKDD'11

21 Linearized BP vs BP BP is approximated by Linearized BP ? ? d1 d2 d3 d1 d2 d3 linearnon-linear Belief Propagation Our proposal:Original [Yedidia+]: © Danai Koutra - PKDD'11

22 Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs ✓ © Danai Koutra - PKDD'11

23 DETAILS! Linearized BP converges if Linearized BP: convergence Theorem degree of node n 1-norm < 1 OR Frobenius norm < 1 1-norm < 1 OR Frobenius norm < 1 Sketch of proof © Danai Koutra - PKDD'11

24 Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs ✓ ✓ © Danai Koutra - PKDD'11

25 Roadmap Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions © Danai Koutra - PKDD'11

26 Correspondence of Methods MethodMatrixUnknownknown RWR [I – c AD -1 ]×x=(1-c)y SSL [I + a (D - A)] ×x=y F A BP [I + a D - c ’ A] ×bhbh =φhφh ? ? d1 d2 d3 d1 d2 d3 final labels/ beliefs prior labels/ beliefs adjacency matrix © Danai Koutra - PKDD'11

27 RWR ≈ SSL RWR and SSL identical if THEOREM individual homophily strength of node i (SSL) fly-out probability (RWR) Simplification global homophily strength of nodes (SSL) DETAILS! © Danai Koutra - PKDD'11

28 RWR ≈ SSL: example similar scores and identical rankings y = x RWR scores SSL scores individual hom. strength global hom. strength © Danai Koutra - PKDD'11

29 Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs ✓ ✓ ✓ © Danai Koutra - PKDD'11

30 Roadmap Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions © Danai Koutra - PKDD'11

31 Proposed algorithm: F A BP ①Pick the homophily factor ②Solve the linear system ①(opt) If accuracy is low, run BP with prior beliefs ? ? d1 d2 d3 d1 d2 d3 0.5 pipi 0 “ ” 1 © Danai Koutra - PKDD'11

32 Roadmap Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions © Danai Koutra - PKDD'11

33 Datasets p% labeled nodes initially YahooWeb:.edu/others | DBLP: AI/not AI accuracy computed on hold-out set Dataset# nodes# edges YahooWeb 1,413,511,3906,636,600,779 Kronecker 1 177,1471,977,149,596 Kronecker 2 120,5521,145,744,786 Kronecker 3 59,049282,416,924 Kronecker 4 19,68340,333,924 DBLP 37,791170,794 6 billion! © Danai Koutra - PKDD'11

34 Specs hadoop version M45 hadoop cluster (Yahoo!)  500 machines  4000 cores  1.5PB total storage  3.5TB of memory 100 machines used for the experiments © Danai Koutra - PKDD'11

35 Roadmap Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments 1. Accuracy 2. Convergence 3. Sensitivity 4. Scalability 5. Parallelism Conclusions © Danai Koutra - PKDD'11

36 Results (1): Accuracy All points on the diagonal  scores near-identical beliefs in BP beliefs in F A BP 0.3% labels Scatter plot of beliefs for (h, priors) = ( 0.5±0.002, 0.5±0.001 ) AI non-AI © Danai Koutra - PKDD'11

37 Results (2): Convergence F A BP achieves maximum accuracy within the convergence bounds. Accuracy wrt h h (priors = ±0.001) 0.3% labels h % accuracy frobenius norm |e_val| = 1 1-norm convergence bounds h © Danai Koutra - PKDD'11

38 Accuracy wrt h h (priors = ±0.001) 0.3% labels h % accuracy frobenius norm |e_val| = 1 1-norm F A BP is robust to the homophily factor h h within the convergence bounds. Results (3): Sensitivity to the homophily factor convergence bounds © Danai Koutra - PKDD'11

39 ( For all plots ) Average over 10 runs Error bars   tiny h % accuracy h prior beliefs’ magnitude note © Danai Koutra - PKDD'11

40 Results (3): Sensitivity to the prior beliefs F A BP is robust to the prior beliefs φ h. % accuracy prior beliefs’ magnitude Accuracy wrt priors (h h = ±0.002) p=5% p=0.1% p=0.3% p=0.5% © Danai Koutra - PKDD'11

41 Results (4): Scalability F A BP is linear on the number of edges. # of edges (Kronecker graphs) runtime (min) © Danai Koutra - PKDD'11

42 Results (5): Parallelism F A BP ~2x faster & wins/ties on accuracy. # of steps runtime (min) % accuracy runtime (min) © Danai Koutra - PKDD'11

43 Roadmap Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions © Danai Koutra - PKDD'11

44 Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs ~2x faster 6 billion edges! same/better ✓ ✓ ✓ ✓ ✓ © Danai Koutra - PKDD'11

45 Thanks Data Funding NSC ILLINOIS Ming Ji, Jiawei Han © Danai Koutra - PKDD'11

46 Thank you! % accuracy runtime (min) © Danai Koutra - PKDD'11

47 Q: Can we have multiple classes? AI ML DB Propagation matrix A: yes! © Danai Koutra - PKDD'11

48 Q: Which of the methods do you recommend? A: (Fast) Belief Propagation Reasons: solid bayesian foundation heterophily and multiple classes Propagation matrix © Danai Koutra - PKDD'11

49 Q: Why is F A BP faster than BP? A: BP 2|E| messages per iteration F A BP |V| records per “power method” iteration |V| < 2 |E| © Danai Koutra - PKDD'11


Download ppt "School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems."

Similar presentations


Ads by Google