Weiren Yu 1, Xuemin Lin 1, Wenjie Zhang 1, Ying Zhang 1 Jiajin Le 2, SimFusion+: Extending SimFusion Towards Efficient Estimation on Large and Dynamic.

Slides:

Advertisements

Similar presentations

iRobot: An Intelligent Crawler for Web Forums

Advertisements

Online Max-Margin Weight Learning with Markov Logic Networks Tuyen N. Huynh and Raymond J. Mooney Machine Learning Group Department of Computer Science.

Advanced Piloting Cruise Plot.

Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.

Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.

Fundamental Relationship between Node Density and Delay in Wireless Ad Hoc Networks with Unreliable Links Shizhen Zhao, Luoyi Fu, Xinbing Wang Department.

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

MIT 2.71/2.710 Optics 10/25/04 wk8-a-1 The spatial frequency domain.

Summary of Convergence Tests for Series and Solved Problems

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13

Title Subtitle.

My Alphabet Book abcdefghijklm nopqrstuvwxyz.

Fourth normal form: 4NF 1. 2 Normal forms desirable forms for relations in DB design eliminate redundancies avoid update anomalies enforce integrity constraints.

1 Term 2, 2004, Lecture 2, Normalisation - IntroductionMarian Ursu, Department of Computing, Goldsmiths College Normalisation Introduction.

Mean-Field Theory and Its Applications In Computer Vision1 1.

Spectral Clustering Eyal David Image Processing seminar May 2008.

1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.

Ken C. K. Lee, Baihua Zheng, Huajing Li, Wang-Chien Lee VLDB 07 Approaching the Skyline in Z Order 1.

Multi-Guarded Safe Zone: An Effective Technique to Monitor Moving Circular Range Queries Presented By: Muhammad Aamir Cheema 1 Joint work with Ljiljana.

On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach

Randomized Algorithms Randomized Algorithms CS648 1.

ABC Technology Project

1 Generating Network Topologies That Obey Power LawsPalmer/Steffan Carnegie Mellon Generating Network Topologies That Obey Power Laws Christopher R. Palmer.

Shadow Prices vs. Vickrey Prices in Multipath Routing Parthasarathy Ramanujam, Zongpeng Li and Lisa Higham University of Calgary Presented by Ajay Gopinathan.

Capacity-Approaching Codes for Reversible Data Hiding Weiming Zhang, Biao Chen, and Nenghai Yu Department of Electrical Engineering & Information Science.

Simultaneous Routing and Resource Allocation in Wireless Networks Mikael Johansson Signals, Sensors and Systems, KTH Joint work with Lin Xiao and Stephen.

15. Oktober Oktober Oktober 2012.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Text Categorization.

Quadratic Inequalities

1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.

Partitional Algorithms to Detect Complex Clusters

Scale Free Networks.

1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University

Squares and Square Root WALK. Solve each problem REVIEW:

Computer Science and Engineering Diversified Spatial Keyword Search On Road Networks Chengyuan Zhang 1,Ying Zhang 2,1,Wenjie Zhang 1, Xuemin Lin 3,1, Muhammad.

Routing and Congestion Problems in General Networks Presented by Jun Zou CAS 744.

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.

1 Weiren Yu 1,2, Xuemin Lin 1, Wenjie Zhang 1 1 University of New South Wales 2 NICTA, Australia Towards Efficient SimRank Computation over Large Networks.

© 2012 National Heart Foundation of Australia. Slide 2.

6.4 Best Approximation; Least Squares

Chapter 5 Test Review Sections 5-1 through 5-4.

Addition 1’s to 20.

25 seconds left…...

Complexity ©D.Moshkovits 1 Where Can We Draw The Line? On the Hardness of Satisfiability Problems.

Music Recommendation by Unified Hypergraph: Music Recommendation by Unified Hypergraph: Combining Social Media Information and Music Content Jiajun Bu,

We will resume in: 25 Minutes.

Dantzig-Wolfe Decomposition

1 Unit 1 Kinematics Chapter 1 Day

Choosing an Order for Joins

Other Dynamic Programming Problems

A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh.

CO-AUTHOR RELATIONSHIP PREDICTION IN HETEROGENEOUS BIBLIOGRAPHIC NETWORKS Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han 1.

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.

Epp, section 10.? CS 202 Aaron Bloomfield

1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,

all-pairs shortest paths in undirected graphs

Computer Science and Engineering Inverted Linear Quadtree: Efﬁcient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.

P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.

Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova ， Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.

Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.

RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,

Asymmetric Transitivity Preserving Graph Embedding

Presentation transcript:

Weiren Yu 1, Xuemin Lin 1, Wenjie Zhang 1, Ying Zhang 1 Jiajin Le 2, SimFusion+: Extending SimFusion Towards Efficient Estimation on Large and Dynamic Networks 1 University of New South Wales & NICTA, Australia 2 Donghua University, China SIGIR 2012

2 2. Problem Definition Contents 4. Experimental Results 1. Introduction 3. Optimization Techniques

3 1. Introduction  Many applications require a measure of “similarity” between objects. similarity search similarity search Citation of Scientific Papers (citeseer.com) (amazon.com) Recommender System Graph Clustering Web Search Engine (google.com)

4 SimFusion: A New Link-based Similarity Measure  Structural Similarity Measure  PageRank [Page et. al, 99]  SimRank [Jeh and Widom, KDD 02]  SimFusion similarity  A new promising structural measure [Xi et. al, SIGIR 05]  Extension of Co-Citation and Coupling metrics  Basic Philosophy  Following the Reinforcement Assumption: The similarity between objects is reinforced by the similarity of their related objects.

5 SimFusion Overview  Features  Using a Unified Relationship Matrix (URM) to represent relationships among heterogeneous data  Defined recursively and is computed iteratively  Applicable to any domain with object-to-object relationships  Challenges  URM may incur trivial solution or divergence issue of SimFusion.  Rather costly to compute SimFusion on large graphs  Naïve Iteration: matrix-matrix multiplication  Requiring O(Kn 3 ) time, O(n 2 ) space [Xi et. al., SIGIR 05]  No incremental algorithms when edges update

6 Existing SimFusion: URM and USM  Data Space: a finite set of data objects (vertices)  Data Relation (edges) Given an entire space  Intra-type Relation carrying info. within one space  Inter-type Relation carrying info. between spaces  Unified Relationship Matrix (URM):  λ i,j is the weighting factor between D i and D j  Unified Similarity Matrix (USM):

7 Example. High complexity !!! O(Kn 3 ) time O(n 2 ) space High complexity !!! O(Kn 3 ) time O(n 2 ) space SimFusion Similarity on Heterogeneous Domain Trivial Solution !!! S=[1] nxn Trivial Solution !!! S=[1] nxn

8 Contributions  Revising the existing SimFusion model, avoiding  non-semantic convergence  divergence issue  Optimizing the computation of SimFusion+  O(Km) pre-computation time, plus O(1) time and O(n) space  Better accuracy guarantee  Incremental computation on edge updates  O(δn) time and O(n) space for handling δ edge updates

9 Revised SimFusion  Motivation: Two issues of the existing SimFusion model  Trivial Solution on Heterogeneous Domain  Divergent Solution on Homogeneous Domain Root cause: row normalization of URM !!!

10 From URM to UAM  Unified Adjacency Matrix (UAM)  Example

11 Revised SimFusion+  Basic Intuition  replace URM with UAM to postpone “row normalization” in a delayed fashion while preserving the reinforcement assumption of the original SimFusion  Revised SimFusion+ Model Original SimFusion squeeze similarity scores in S into [0, 1].

12 Optimizing SimFusion+ Computation  Conventional Iterative Paradigm  Matrix-matrix multiplication, requiring O(kn 3 ) time and O(n 2 ) space  Our approach: To convert SimFusion+ computation into finding the dominant eigenvector of the UAM A.  Matrix-vector multiplication, requiring O(km) time and O(n) space Pre-compute σ max (A) only once, and cache it for later reuse

13 Example  Conventional Iteration:  Our approach: Assume with

14 Key Observation  Kroneckor product “ ⊗ ”: e.g.  Vec operator: e.g.  Two important Properties:

15 Key Observation  Two important Properties: P1. P2.  Our main idea: (1) (2) Power Iteration

16 Accuracy Guarantee  Conventional Iterations: No accuracy guarantee !!! Question: || S (k+1) – S || ≤ ?  Our Method: Utilize Arnoldi decomposition to build an order-k orthogonal subspace for the UAM A. Due to T k small size and almost “upper-triangularity”, Computing σ max (T k ) is less costly than σ max (A). Due to T k small size and almost “upper-triangularity”, Computing σ max (T k ) is less costly than σ max (A).

17 Accuracy Guarantee  Arnoldi Decomposition:  k-th iterative similarity  Estimate Error:

18 Example  Arnoldi Decomposition: Assume with Given (1) (2) (3)

19 Edge Update on Dynamic Graphs  Incremental UAM Given old G =(D,R) and a new G’=(D,R’), the incremental UAM is a list of edge updates, i.e.,  Main idea To reuse and the eigen-pair (α p, ξ p ) of the old A to compute is a sparse matrix when the number δ of edge updates is small.  Incrementally computing SimFusion+ O(δn) time O(n) space

20 Example Suppose edges (P1,P2) and (P2,P1) are removed.

21 Experimental Setting  Datasets  Synthetic data (RAND 0.5M-3.5M)  Real data (DBLP, WEBKB)  Compared Algorithms  SimFusion+ and IncSimFusion+ ;  SF, a SimFusion algorithm via matrix iteration [Xi et. al, SIGIR 05] ;  CSF, a variant SF, using PageRank distribution [Cai et. al, SIGIR 10] ;  SR, a SimRank algorithm via partial sums [Lizorkin et. al, VLDBJ 10] ;  PR, a P-Rank encoding both in- and out-links [Zhao et. al, CIKM 09] ; DBLP WEBKB

22 Experiment (1): Accuracy On DBLP and WEBKB SF+ accuracy is consistently stable on different datasets. SF seems hardly to get sensible similarities as all its similarities asymptotically approach the same value as K grows. SF seems hardly to get sensible similarities as all its similarities asymptotically approach the same value as K grows.

23 Experiment (2): CPU Time and Space On DBLP On WEBKB SF+ outperforms the other approaches, due to the use of σ max (T k )

24 Experiment (3): Edge Updates IncSF+ outperformed SF+ when δ is small. For larger δ, IncSF+ is not that good because the small value of δ preserves the sparseness of the incremental UAM. Varying δ

25 Experiment (4) : Effects of The small choice of imposes more iterations on computing T k and v k, and hence increases the estimation costs.

26 Conclusions  A revision of SimFusion+, for preventing the trivial solution and the divergence issue of the original model.  Efficient techniques to improve the time and space of SimFusion+ with accuracy guarantees.  An incremental algorithm to compute SimFusion+ on dynamic graphs when edges are updated.  Devise vertex-updating methods for incrementally computing SimFusion+.  Extend to parallelize SimFusion+ computing on GPU. Future Work

27