School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems.

Slides:

Advertisements

Similar presentations

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Advertisements

Liang Shan Clustering Techniques and Applications to Image Segmentation.

CMU SCS Identifying on-line Fraudsters: Anomaly Detection Using Network Effects Christos Faloutsos CMU.

Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.

BiG-Align: Fast Bipartite Graph Alignment

On the Vulnerability of Large Graphs

CMU SCS I2.2 Large Scale Information Network Processing INARC 1 Overview Goal: scalable algorithms to find patterns and anomalies on graphs 1. Mining Large.

Overview of this week Debugging tips for ML algorithms

School of Computer Science Carnegie Mellon University Duke University DeltaCon: A Principled Massive- Graph Similarity Function Danai Koutra Joshua T.

+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Dept. of Computer Science Rutgers Node and Graph Similarity : Theory and Applications Danai Koutra (CMU) Tina Eliassi-Rad (Rutgers) Christos Faloutsos.

Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.

The Connectivity and Fault-Tolerance of the Internet Topology

Dept. of Computer Science Rutgers Node Similarity, Graph Similarity and Matching: Theory and Applications Danai Koutra (CMU) Tina Eliassi-Rad (Rutgers)

ICCV 2007 tutorial Part III Message-passing algorithms for energy minimization Vladimir Kolmogorov University College London.

Introduction to Belief Propagation and its Generalizations. Max Welling Donald Bren School of Information and Computer and Science University of California.

Endend endend Carnegie Mellon University Korea Advanced Institute of Science and Technology VoG: Summarizing and Understanding Large Graphs Danai Koutra.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

CMU SCS C. Faloutsos (CMU)#1 Large Graph Algorithms Christos Faloutsos CMU McGlohon, Mary Prakash, Aditya Tong, Hanghang Tsourakakis, Babis Akoglu, Leman.

CMU SCS Mining Billion-node Graphs Christos Faloutsos CMU.

Gunhee Kim1 Eric P. Xing1 Li Fei-Fei2 Takeo Kanade1

Graph Based Semi- Supervised Learning Fei Wang Department of Statistical Science Cornell University.

© 2011 IBM Corporation IBM Research SIAM-DM 2011, Mesa AZ, USA, Non-Negative Residual Matrix Factorization w/ Application to Graph Anomaly Detection Hanghang.

Detecting Fraudulent Personalities in Networks of Online Auctioneers Duen Horng (“Polo”) Chau Shashank Pandit Christos Faloutsos School of Computer Science.

Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.

Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.

Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.

Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research

Graph-Based Semi-Supervised Learning with a Generative Model Speaker: Jingrui He Advisor: Jaime Carbonell Machine Learning Department

Fast Random Walk with Restart and Its Applications

The Role of Specialization in LDPC Codes Jeremy Thorpe Pizza Meeting Talk 2/12/03.

Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.

Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.

Solving Scalar Linear Systems Iterative approach Lecture 15 MA/CS 471 Fall 2003.

School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems.

2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.

Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.

Statistical Sampling-Based Parametric Analysis of Power Grids Dr. Peng Li Presented by Xueqian Zhao EE5970 Seminar.

Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.

1/52 Overlapping Community Search Graph Data Management Lab, School of Computer Science

On Node Classification in Dynamic Content-based Networks.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.

Survey Propagation. Outline Survey Propagation: an algorithm for satisfiability 1 – Warning Propagation – Belief Propagation – Survey Propagation Survey.

CMU SCS Mining Large Graphs: Fraud Detection, and Algorithms Christos Faloutsos CMU.

Fast Random Walk with Restart and Its Applications Hanghang Tong, Christos Faloutsos and Jia-Yu (Tim) Pan ICDM 2006 Dec , HongKong.

Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.

RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval Yin-Hsi Kuo1,2, Hsuan-Tien Lin 1, Wen-Huang Cheng 2, Yi-Hsuan Yang.

Practical Message-passing Framework for Large-scale Combinatorial Optimization Inho Cho, Soya Park, Sejun Park, Dongsu Han, and Jinwoo Shin KAIST 2015.

Single-Pass Belief Propagation

Kijung Shin Jinhong Jung Lee Sael U Kang

Center-Piece Subgraphs: Problem definition and Fast Solutions Hanghang Tong Christos Faloutsos Carnegie Mellon University.

Privacy Preserving in Social Network Based System PRENTER: YI LIANG.

1 / 24 Distributed Methods for High-dimensional and Large-scale Tensor Factorization Kijung Shin (Seoul National University) and U Kang (KAIST)

A Peta-Scale Graph Mining System

Introduction of BP & TRW-S

Optimizing Parallel Algorithms for All Pairs Similarity Search

Sofus A. Macskassy Fetch Technologies

PEGASUS: A PETA-SCALE GRAPH MINING SYSTEM

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16

Binghui Wang, Le Zhang, Neil Zhenqiang Gong

Large Graph Mining: Power Tools and a Practitioner’s guide

GANG: Detecting Fraudulent Users in OSNs

Learning to Rank Typed Graph Walks: Local and Global Approaches

CISE-301: Numerical Methods Topic 1: Introduction to Numerical Methods and Taylor Series Lectures 1-4: KFUPM CISE301_Topic1.

Presentation transcript:

School of Computer Science Carnegie Mellon University National Taiwan University of Science & Technology Unifying Guilt-by-Association Approaches: Theorems and Fast Algorithms Danai Koutra U Kang Hsing-Kuo Kenneth Pao Tai-You Ke Duen Horng (Polo) Chau Christos Faloutsos ECML PKDD, 5-9 September 2011, Athens, Greece

Problem Definition: G B A techniques Given: graph with N nodes & M edges; few labeled nodes Find: class (red/green) for rest nodes Assuming: network effects ( homophily/ heterophily ) © Danai Koutra - PKDD'11

Homophily and Heterophily Step 1 Step 2 All methods handle homophily NOT all methods handle heterophily BUT proposed method does! NOT all methods handle heterophily BUT proposed method does! © Danai Koutra - PKDD'11

Why do we study these methods? © Danai Koutra - PKDD'11

Motivation (1): Law Enforcement [Tong+ ’06][Lin+ ‘04][Chen+ ’11]… © Danai Koutra - PKDD'11

Motivation (2): Cyber Security victims? [ Kephart+ ’95 ] [Kolter+ ’06 ][Song+ ’08-’11][Chau+ ‘11]… botnet members? bot © Danai Koutra - PKDD'11

Motivation (3): Fraud Detection Lax controls? [Neville+ ‘05][Chau+ ’07][McGlohon+ ’09]… fraudsters? fraudster © Danai Koutra - PKDD'11

Motivation (4): Ranking [Brin+ ‘98][Tong+ ’06][Ji+ ‘11]… © Danai Koutra - PKDD'11

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs © Danai Koutra - PKDD'11

Roadmap Background Belief Propagation Random Walk with Restarts Semi-supervised Learning Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions © Danai Koutra - PKDD'11

Background Apologies for diversion… © Danai Koutra - PKDD'11

Background 1: Belief Propagation (BP) Iterative message-based method st round 2 nd round... until stop criterion fulfilled “Propagation matrix”:  Homophily  Heterophily class of “sender” class of “receiver” Usually same diagonal = homophily factor h Usually same diagonal = homophily factor h “about-half” homophily factor h h = h-0.5 “about-half” homophily factor h h = h © Danai Koutra - PKDD'11

Background 1: Belief Propagation Equations [Pearl ‘82][Yedidia+ ‘02] …[Pandit+ ‘07][Gonzalez+ ‘09][Chechetka+ ‘10] © Danai Koutra - PKDD'11

Background 2: Semi-Supervised Learning graph-based SSL use few labeled data & exploit neighborhood information STEP1STEP1 STEP1STEP1 STEP2STEP2 STEP2STEP ? ? [Zhou ‘06][Ji, Han ’10]… © Danai Koutra - PKDD'11

Background 3: Personalized Random Walk with Restarts (RWR) [Brin+ ’98][Haveliwala ’03][Tong+ ‘06][Minkov, Cohen ‘07]… © Danai Koutra - PKDD'11

Background © Danai Koutra - PKDD'11

Qualitative Comparison of G B A Methods GBA Method HeterophilyScalabilityConvergence RWR ✗✓✓ SSL ✗✓✓ BP ✓✓ ? F A BP ✓✓✓ © Danai Koutra - PKDD'11

Qualitative Comparison of G B A Methods GBA Method HeterophilyScalabilityConvergence RWR ✗✓✓ SSL ✗✓✓ BP ✓✓ ? F A BP ✓✓✓ © Danai Koutra - PKDD'11

Roadmap Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions New work Previous work © Danai Koutra - PKDD'11

Linearized BP Odds ratio Maclaurin expansions Odds ratio Maclaurin expansions BP is approximated by Theorem [Koutra+] Sketch of proof ? ? d1 d2 d3 d1 d2 d3 final beliefs prior beliefs scalar constants 0.5 pipi 0 “ ” 1 DETAILS! © Danai Koutra - PKDD'11

Linearized BP vs BP BP is approximated by Linearized BP ? ? d1 d2 d3 d1 d2 d3 linearnon-linear Belief Propagation Our proposal:Original [Yedidia+]: © Danai Koutra - PKDD'11

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs ✓ © Danai Koutra - PKDD'11

DETAILS! Linearized BP converges if Linearized BP: convergence Theorem degree of node n 1-norm < 1 OR Frobenius norm < 1 1-norm < 1 OR Frobenius norm < 1 Sketch of proof © Danai Koutra - PKDD'11

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs ✓ ✓ © Danai Koutra - PKDD'11

Roadmap Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions © Danai Koutra - PKDD'11

Correspondence of Methods MethodMatrixUnknownknown RWR [I – c AD -1 ]×x=(1-c)y SSL [I + a (D - A)] ×x=y F A BP [I + a D - c ’ A] ×bhbh =φhφh ? ? d1 d2 d3 d1 d2 d3 final labels/ beliefs prior labels/ beliefs adjacency matrix © Danai Koutra - PKDD'11

RWR ≈ SSL RWR and SSL identical if THEOREM individual homophily strength of node i (SSL) fly-out probability (RWR) Simplification global homophily strength of nodes (SSL) DETAILS! © Danai Koutra - PKDD'11

RWR ≈ SSL: example similar scores and identical rankings y = x RWR scores SSL scores individual hom. strength global hom. strength © Danai Koutra - PKDD'11

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs ✓ ✓ ✓ © Danai Koutra - PKDD'11

Roadmap Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions © Danai Koutra - PKDD'11

Proposed algorithm: F A BP ①Pick the homophily factor ②Solve the linear system ①(opt) If accuracy is low, run BP with prior beliefs ? ? d1 d2 d3 d1 d2 d3 0.5 pipi 0 “ ” 1 © Danai Koutra - PKDD'11

Roadmap Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions © Danai Koutra - PKDD'11

Datasets p% labeled nodes initially YahooWeb:.edu/others | DBLP: AI/not AI accuracy computed on hold-out set Dataset# nodes# edges YahooWeb 1,413,511,3906,636,600,779 Kronecker 1 177,1471,977,149,596 Kronecker 2 120,5521,145,744,786 Kronecker 3 59,049282,416,924 Kronecker 4 19,68340,333,924 DBLP 37,791170,794 6 billion! © Danai Koutra - PKDD'11

Specs hadoop version M45 hadoop cluster (Yahoo!)  500 machines  4000 cores  1.5PB total storage  3.5TB of memory 100 machines used for the experiments © Danai Koutra - PKDD'11

Roadmap Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments 1. Accuracy 2. Convergence 3. Sensitivity 4. Scalability 5. Parallelism Conclusions © Danai Koutra - PKDD'11

Results (1): Accuracy All points on the diagonal  scores near-identical beliefs in BP beliefs in F A BP 0.3% labels Scatter plot of beliefs for (h, priors) = ( 0.5±0.002, 0.5±0.001 ) AI non-AI © Danai Koutra - PKDD'11

Results (2): Convergence F A BP achieves maximum accuracy within the convergence bounds. Accuracy wrt h h (priors = ±0.001) 0.3% labels h % accuracy frobenius norm |e_val| = 1 1-norm convergence bounds h © Danai Koutra - PKDD'11

Accuracy wrt h h (priors = ±0.001) 0.3% labels h % accuracy frobenius norm |e_val| = 1 1-norm F A BP is robust to the homophily factor h h within the convergence bounds. Results (3): Sensitivity to the homophily factor convergence bounds © Danai Koutra - PKDD'11

( For all plots ) Average over 10 runs Error bars   tiny h % accuracy h prior beliefs’ magnitude note © Danai Koutra - PKDD'11

Results (3): Sensitivity to the prior beliefs F A BP is robust to the prior beliefs φ h. % accuracy prior beliefs’ magnitude Accuracy wrt priors (h h = ±0.002) p=5% p=0.1% p=0.3% p=0.5% © Danai Koutra - PKDD'11

Results (4): Scalability F A BP is linear on the number of edges. # of edges (Kronecker graphs) runtime (min) © Danai Koutra - PKDD'11

Results (5): Parallelism F A BP ~2x faster & wins/ties on accuracy. # of steps runtime (min) % accuracy runtime (min) © Danai Koutra - PKDD'11

Roadmap Background Linearized BP Correspondence of Methods Proposed Algorithm Experiments Conclusions © Danai Koutra - PKDD'11

Our Contributions Theory  correspondence: BP ≈ RWR ≈ SSL  linearization for BP  convergence criteria for linearized BP Practice  F A BP algorithm  fast  accurate and  scalable  Experiments on DBLP, Web, and Kronecker graphs ~2x faster 6 billion edges! same/better ✓ ✓ ✓ ✓ ✓ © Danai Koutra - PKDD'11

Thanks Data Funding NSC ILLINOIS Ming Ji, Jiawei Han © Danai Koutra - PKDD'11

Thank you! % accuracy runtime (min) © Danai Koutra - PKDD'11

Q: Can we have multiple classes? AI ML DB Propagation matrix A: yes! © Danai Koutra - PKDD'11

Q: Which of the methods do you recommend? A: (Fast) Belief Propagation Reasons: solid bayesian foundation heterophily and multiple classes Propagation matrix © Danai Koutra - PKDD'11

Q: Why is F A BP faster than BP? A: BP 2|E| messages per iteration F A BP |V| records per “power method” iteration |V| < 2 |E| © Danai Koutra - PKDD'11