School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems.

School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems and Fast Algorithms Danai Koutra U Kang Hsing-Kuo Kenneth Pao Tai-You Ke Duen Horng (Polo) Chau Christos Faloutsos ECML PKDD, 5-9 September 2011, Athens, Greece

Problem Definition: G B A techniques Danai Koutra (CMU) - PKDD 20111 Given: graph with N nodes & M edges; few labeled nodes Find: class (red/green) for rest nodes Assuming: network effects ( homophily/ heterophily )

Homophily and Heterophily Danai Koutra (CMU) - PKDD 20112 Step 1 Step 2 All methods handle homophily NOT all methods handle heterophily BUT proposed method does! NOT all methods handle heterophily BUT proposed method does!

Why do we study these methods? Danai Koutra (CMU) - PKDD 20113

Motivation (1): Law Enforcement Danai Koutra (CMU) - PKDD 20114 [Tong+ ’06][Lin+ ‘04][Chen+ ’11]…

Motivation (2): Cyber Security Danai Koutra (CMU) - PKDD 20115 victims? [ Kephart+ ’95 ] [Kolter+ ’06 ][Song+ ’08-’11][Chau+ ‘11]… botnet members? bot

Motivation (3): Fraud Detection Danai Koutra (CMU) - PKDD 20116 Lax controls? [Neville+ ‘05][Chau+ ’07][McGlohon+ ’09]… fraudsters? fraudster

Motivation (4): Ranking Danai Koutra (CMU) - PKDD 20117 [Brin+ ‘98][Tong+ ’06][Ji+ ‘11]…

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD 20118

Roadmap Danai Koutra (CMU) - PKDD 20119 Background Belief Propagation Random Walk with Restarts Semi-supervised Learning Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions

Background Danai Koutra (CMU) - PKDD 201110 Apologies for diversion…

Background 1: Belief Propagation (BP) Iterative message-based method Danai Koutra (CMU) - PKDD 201111 0.90.1 0.20.8 0.30.7 0.90.1 1 st round 2 nd round... until stop criterion fulfilled “Propagation matrix”:  Homophily  Heterophily 0.90.1 0.9 class of “sender” class of “receiver” Usually same diagonal = homophily factor h Usually same diagonal = homophily factor h “about-half” homophily factor h h = h-0.5 “about-half” homophily factor h h = h-0.5 0.4-0.4 0.4

Danai Koutra (CMU) - PKDD 201112 Background 1: Belief Propagation Equations [Pearl ‘82][Yedidia+ ‘02] …[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10]

Background 2: Semi-Supervised Learning graph-based SSL use few labeled data & exploit neighborhood information Danai Koutra (CMU) - PKDD 201113 STEP1STEP1 STEP1STEP1 STEP2STEP2 STEP2STEP2 0.8 -0.3 ? ? -0.1 0.6 0.8 [Zhou ‘06][Ji, Han ’10]…

Background 3: Personalized Random Walk with Restarts (RWR) Danai Koutra (CMU) - PKDD 201114 [Brin+ ’98][Haveliwala ’03][Tong+ ‘06][Minkov, Cohen ‘07]…

Danai Koutra (CMU) - PKDD 201115 Background

Qualitative Comparison of G B A Methods Danai Koutra (CMU) - PKDD 201116 GBA Method HeterophilyScalabilityConvergence RWR ✗✓✓ SSL ✗✓✓ BP ✓✓ ? F A BP ✓✓✓

Qualitative Comparison of G B A Methods Danai Koutra (CMU) - PKDD 201117 GBA Method HeterophilyScalabilityConvergence RWR ✗✓✓ SSL ✗✓✓ BP ✓✓ ? F A BP ✓✓✓

Roadmap Danai Koutra (CMU) - PKDD 201118 Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions New work Previous work

Linearized BP Odds ratio Maclaurin expansions Odds ratio Maclaurin expansions Danai Koutra (CMU) - PKDD 201119 BP is approximated by Theorem [Koutra+] Sketch of proof 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 ? ? 0 -10 -2 10 -2 0 -10 -2 10 -2 1 d1 d2 d3 d1 d2 d3 final beliefs prior beliefs scalar constants 0.5 pipi 0 “ ” 1 DETAILS!

Linearized BP vs BP Danai Koutra (CMU) - PKDD 201120 BP is approximated by Linearized BP 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 ? ? 0 -10 -2 10 -2 0 -10 -2 10 -2 1 d1 d2 d3 d1 d2 d3 linearnon-linear Belief Propagation Our proposal:Original [Yedidia+]:

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD 201121 ✓

DETAILS! Linearized BP converges if Linearized BP: convergence Danai Koutra (CMU) - PKDD 201122 Theorem degree of node n 1-norm < 1 OR Frobenius norm < 1 1-norm < 1 OR Frobenius norm < 1 Sketch of proof

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD 201123 ✓ ✓

Roadmap Danai Koutra (CMU) - PKDD 201124 Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions

Correspondence of Methods Danai Koutra (CMU) - PKDD 201125 MethodMatrixUnknownknown RWR [I – c AD -1 ]×x=(1-c)y SSL [I + a (D - A)] ×x=y F A BP [I + a D - c ’ A] ×bhbh =φhφh 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 ? ? 0 1 0 1 d1 d2 d3 d1 d2 d3 final labels/ beliefs prior labels/ beliefs adjacency matrix

RWR ≈ SSL Danai Koutra (CMU) - PKDD 201126 RWR and SSL identical if THEOREM individual homophily strength of node i (SSL) fly-out probability (RWR) Simplification global homophily strength of nodes (SSL) DETAILS!

RWR ≈ SSL: example Danai Koutra (CMU) - PKDD 201127 similar scores and identical rankings y = x RWR scores SSL scores individual hom. strength global hom. strength

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD 201128 ✓ ✓ ✓

Proposed algorithm: F A BP ①Pick the homophily factor ②Solve the linear system ①(opt) If accuracy is low, run BP with prior beliefs. Danai Koutra (CMU) - PKDD 201130 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 ? ? 0 1 0 1 d1 d2 d3 d1 d2 d3 0.5 pipi 0 “ ” 1

Datasets Danai Koutra (CMU) - PKDD 201132 p% labeled nodes initially YahooWeb:.edu/others | DBLP: AI/not AI accuracy computed on hold-out set Dataset# nodes# edges YahooWeb 1,413,511,3906,636,600,779 Kronecker 1 177,1471,977,149,596 Kronecker 2 120,5521,145,744,786 Kronecker 3 59,049282,416,924 Kronecker 4 19,68340,333,924 DBLP 37,791170,794 6 billion!

Specs hadoop version 0.20.2 M45 hadoop cluster (Yahoo!)  500 machines  4000 cores  1.5PB total storage  3.5TB of memory 100 machines used for the experiments Danai Koutra (CMU) - PKDD 201133

Roadmap Danai Koutra (CMU) - PKDD 201134 Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments 1. Accuracy 2. Convergence 3. Sensitivity 4. Scalability 5. Parallelism Conclusions

Results (1): Accuracy Danai Koutra (CMU) - PKDD 201135 All points on the diagonal  scores near-identical beliefs in BP beliefs in F A BP 0.3% labels Scatter plot of beliefs for (h, priors) = ( 0.5±0.002, 0.5±0.001 ) AI non-AI

Results (2): Convergence Danai Koutra (CMU) - PKDD 201136 F A BP achieves maximum accuracy within the convergence bounds. Accuracy wrt h h (priors = ±0.001) 0.3% labels h % accuracy frobenius norm |e_val| = 1 1-norm convergence bounds h

Danai Koutra (CMU) - PKDD 201137 Accuracy wrt h h (priors = ±0.001) 0.3% labels h % accuracy frobenius norm |e_val| = 1 1-norm F A BP is robust to the homophily factor h h within the convergence bounds. Results (3): Sensitivity to the homophily factor convergence bounds

( For all plots ) Danai Koutra (CMU) - PKDD 201138 Average over 10 runs Error bars   tiny h % accuracy h prior beliefs’ magnitude note

Results (3): Sensitivity to the prior beliefs Danai Koutra (CMU) - PKDD 201139 F A BP is robust to the prior beliefs φ h. % accuracy prior beliefs’ magnitude Accuracy wrt priors (h h = ±0.002) p=5% p=0.1% p=0.3% p=0.5%

Results (4): Scalability Danai Koutra (CMU) - PKDD 201140 F A BP is linear on the number of edges. # of edges (Kronecker graphs) runtime (min)

Results (5): Parallelism Danai Koutra (CMU) - PKDD 201141 F A BP ~2x faster & wins/ties on accuracy. # of steps runtime (min) % accuracy runtime (min)

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs Danai Koutra (CMU) - PKDD 201143 ~2x faster 6 billion edges! same/better ✓ ✓ ✓ ✓ ✓

Thanks Data Funding Danai Koutra (CMU) - PKDD 201144 NSC ILLINOIS Ming Ji, Jiawei Han

Thank you! Danai Koutra (CMU) - PKDD 201145 % accuracy runtime (min)

Danai Koutra (CMU) - PKDD 201146 Q: Can we have multiple classes? AI ML DB 0.70.20.1 0.20.60.2 0.10.20.7 Propagation matrix A: yes!

Q: Which of the methods do you recommend? A: (Fast) Belief Propagation Reasons: solid bayesian foundation heterophily and multiple classes Danai Koutra (CMU) - PKDD 201147 0.70.20.1 0.20.60.2 0.10.20.7 Propagation matrix

Q: Why is F A BP faster than BP? A: BP 2|E| messages per iteration F A BP |V| records per “power method” iteration Danai Koutra (CMU) - PKDD 201148 |V| < 2 |E|

School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems.

Similar presentations

Presentation on theme: "School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems.

Similar presentations

Presentation on theme: "School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems."— Presentation transcript:

Similar presentations

About project

Feedback