Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad

Slides:

Advertisements

Similar presentations

On the Vulnerability of Large Graphs

Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions

1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.

Minimum Spanning Trees Definition Two properties of MST’s Prim and Kruskal’s Algorithm –Proofs of correctness Boruvka’s algorithm Verifying an MST Randomized.

Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International.

C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.

CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.

© 2010 IBM Corporation Diversified Ranking on Large Graphs: An Optimization Viewpoint Hanghang Tong, Jingrui He, Zhen Wen, Ching-Yung Lin, Ravi Konuru.

SCS CMU Proximity Tracking on Time- Evolving Bipartite Graphs Speaker: Hanghang Tong Joint Work with Spiros Papadimitriou, Philip S. Yu, Christos Faloutsos.

Aho-Corasick String Matching An Efficient String Matching.

Fast Random Walk with Restart and Its Applications

Two Discrete Optimization Problems Problem #2: The Minimum Cost Spanning Tree Problem.

Copyright © Cengage Learning. All rights reserved.

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

06 - Boundary Models Overview Edge Tracking Active Contours Conclusion.

Slide 14-1 Copyright © 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.

Querying Structured Text in an XML Database By Xuemei Luo.

Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.

On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.

KDD 2007, San Jose Fast Direction-Aware Proximity for Graph Mining Speaker: Hanghang Tong Joint work w/ Yehuda Koren, Christos Faloutsos.

Fast Random Walk with Restart and Its Applications Hanghang Tong, Christos Faloutsos and Jia-Yu (Tim) Pan ICDM 2006 Dec , HongKong.

Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad

© 2010 Pearson Prentice Hall. All rights reserved. CHAPTER 15 Graph Theory.

KDD 2007, San Jose Fast Direction-Aware Proximity for Graph Mining Speaker: Hanghang Tong Joint work w/ Yehuda Koren, Christos Faloutsos.

EXCURSIONS IN MODERN MATHEMATICS SIXTH EDITION Peter Tannenbaum 1.

Two Connected Dominating Set Algorithms for Wireless Sensor Networks Overview Najla Al-Nabhan* ♦ Bowu Zhang** ♦ Mznah Al-Rodhaan* ♦ Abdullah Al-Dhelaan*

Center-Piece Subgraphs: Problem definition and Fast Solutions Hanghang Tong Christos Faloutsos Carnegie Mellon University.

Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.

Proof of correctness of Dijkstra’s algorithm: Basically, we need to prove two claims. (1)Let S be the set of vertices for which the shortest path from.

::Network Optimization:: Minimum Spanning Trees and Clustering Taufik Djatna, Dr.Eng. 1.

Excursions in Modern Mathematics Sixth Edition

Greedy Algorithms.

Finding Dense and Connected Subgraphs in Dual Networks

Excursions in Modern Mathematics Sixth Edition

Mathematical Foundations of AI

Kleene’s Theorem and NFA

Greedy Technique.

Parallel Density-based Hybrid Clustering

Surviving Holes and Barriers in Geographic Data Reporting for

Using Algebra Tiles to Solve Equations, Combine Like Terms, and use the Distributive Property Objective: To understand the different parts of an equation,

NetMine: Mining Tools for Large Graphs

Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad

Large Graph Mining: Power Tools and a Practitioner’s guide

Relational Algebra 1.

The Importance of Communities for Learning to Influence

Effective Social Network Quarantine with Minimal Isolation Costs

Hidden Markov Models Part 2: Algorithms

Spanning Trees.

Graphs Chapter 13.

Graphs Chapter 11 Objectives Upon completion you will be able to:

5 The Mathematics of Getting Around

Introduction Wireless Ad-Hoc Network

Linear Programming Duality, Reductions, and Bipartite Matching

Diversified Top-k Subgraph Querying in a Large Graph

Bidirectional Query Planning Algorithm

Lecture 14 Shortest Path (cont’d) Minimum Spanning Tree

Chapter 6 Network Flow Models.

Approximation Algorithms

Lecture 13 Shortest Path (cont’d) Minimum Spanning Tree

Donghui Zhang, Tian Xia Northeastern University

EE384Y: Packet Switch Architectures II

Visual Algebra for Teachers

7 The Mathematics of Networks

Proximity in Graphs by Using Random Walks

Invitation to Computer Science 5th Edition

Presentation transcript:

Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad L L N L Graph X-Ray: Fast Best-Effort Pattern Matching in Large Attributed Graphs Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad 8/13/2007 KDD 2007, San Jose

Input Output Query Graph Matching Subgraph Attributed Data Graph Let me start with a synthetic example to illustrate what we want to do. Given a large attributed graph G whose nodes are associated with one categorical attribute. For example given a who-talk-to-whom graph, whose nodes have the job title as attribute, such as CEO, SCE, Accoutant and Manager. , and given a query graph H_q, we want to find the graph H_t that matches the query graph as well as possible. For example, here, Given a loop query H_q, its matching subgraph H_t is shown in the right figure. Next, we will use this example to explain some terminologies involved in G-Ray. Matching Subgraph Attributed Data Graph

Terminology: ``Conform’’ First, We say the subgraph H_t conforms the query graph H_q, if we have all desired job titles and connection between them. Matching Subgraph conforms Query Graph

Terminology: ``Interception’’ Intermediate node matching node matching node matching node matching node We allow the in-directed connection by introducing some extra nodes. For example, the connection between 12 and 4 is indirected. We refer this phenomena as interception, and the extra nodes, e.g. node 13 as intermediate node. And all remaining nodes as matching nodes, e.g. node 11 12,4 and 7. Matching Subgraph Query Graph Path 12-13-4 is an Interception

Terminology: ``Instantiate’’ Matching Subgraph Ht Query Graph Hq Whenever we have a matching subgraph H_t, we say H_t instantiates the query graph H_q. and the matching nodes in H_t instantiates the nodes in the query graph. for example, we say node 11 in H_t instantiates the SEC node in the query graph, and so on. Node 11 instantiates SEC node Ht instantiates Hq

Roadmap Introduction How to: Graph X-Ray Experimental Results Problem Definition Motivations How to: Graph X-Ray Experimental Results Conclusion we have introduced the problem definition and some necessary terminologies. So, why do we care this problem?

Motivation: Why Not SQL? Case 1: Exact match does not exist Q: How to find approximate answer? Case 2: Too many exact matches Q: How to rank them? At first glance, we can use standard sql to address this problem. However, SQL will return no result if exact match does not exist. On the other extreme case, if we have many exact matches, we would like to automatically rank them according to some goodness function

Motivation: Why Not SQL? (Cont.) Case 3: Exact match might be not the best answer ``Find CEO who has heavy contact with Accountant’’ Q: how to find right? Furthermore, I want claim that in some case, exact match might not be always the best answer in some scenarios. For example, if we want to find CEO…… The left subgraph is an exact match and there is one direct connection between CEO 12 and Accountant 1. While, the right subgraph is an inexact match and there are lots of indirect connections between CEO and Accoutant. In this case, the inexact match might be a better answer since it reflects the real suspicious relationship. Exact match 1 direct connection Inexact match Many indirect connections

Motivation: Efficiency Why Not Subgraph Isomorphism? Polynomial for fixed # of pattern query Q1: How to scale up linearly? Q2: … and with a small slope? In terms of computational issue, our problem is polynomial wrt the size of the data graph for a fixed size of pattern query, which is prohibitive for large graphs So, how can we develop an approximate alg. which is linearly wrt to the data graph. And furthermore, we would like to such linear alg. scales with a small slope so that the response time is fast.

Wish List G-Ray meets all! Effectiveness Efficiency Both exact match & inexact Match Ranking among multiple results ``Best’’ answer (proximity-based) Efficiency Scale linearly Scale with small scope To summarize, this is our wish list and without going into the details. I want to claim that our method G-Ray meets all of these requirements. G-Ray meets all!

Roadmap Introduction How to: Graph X-Ray Experimental Results Problem Definition Motivations How to: Graph X-Ray Experimental Results Conclusion Next, I will introduce how G-Ray works . There are two key concepts behind G-Ray. Once we have clarified these concepts, the alg. itself is quite straight-forward.

Preliminary: Center-Piece Subgraph [Tong+] Q The first concept behind G-Ray is CenterPiece subgraph, given some query nodes in a graph, how can we find the nodes that have strong connection to all/most of the query node. For example, …… . Originally, CePS is designed for plain graph (i.e. no attribute on node). In Gray, We use CePS as basic operation Original Graph Black: query nodes CePS is meta opt. in G-Ray!

Preliminary: Augmented Graph Data nodes 1,…13 Attribute nodes a Ok, another key concept is augmented graph. Given an attributed graph, we augment it with some additional nodes, one for each attribute value. For example, for the attributed graph we show at the very beginning, we introduce four additional nodes, red star node for accountant, yellow square for CEO and so on. We refer to these newly added nodes as attribute nodes, and the original nodes as data nodes. Furthermore, we put a directed edge from the attribute node to each data node having that attribute value. An important observation in the augmented graph is that, if we measure the proximity between an attribute node and a data node, e.g. the prox between red star node (the accountant node) and data node 11. That proximity is proportional to the average proximity score between node 11 and all data nodes that have attribute value accountant. Without telling you the detail, I would like to mention that this operation will help to reduce the computational time a lot.. Footnote Aug. Graph is crucial for computation!

G-Ray: quick overview (for loop ) Step 1: SF Step 2: NE Step 3: BR Step 4: NE Step 5: BR Step 6: NE Now, we are ready to give the alg. details. So, here is how the algorithm find the matching subgraph for the loop query shown at the very beginning. In G-Ray, we build the matching subgraph H_t gradually. There are three modules in the alg. First, (at step 1, ) it calls the seed-finder module to find a very promising matching data node with some attribute value according to the query graph when the resulting subgraph H_t is empty. So, here we call seed-finder module to find node 11 as the matching node for the SEC node in the query graph. Node 11 is referred as the seed. Then, it recursively calls the neighbor-expander and bridge until we find a complete matching subgraph. In neighbor-exapnder (as in step 2, 4, and 6), it expands the seed by finding a good matching node with desired attributed value according to the query graph, when the resulting subgraph H_t is partially built. For example, in step 2, we find node 12 as the matching node for CEO node in the query graph, and so on and so forth. In Bridge, we find a good path(as in step 3, 5 7 and 8) between two matching nodes if they are required to be connected according the query graph. For example, in step 7, we find a path to connect node 12 and 4, with node 13 as an intermedidate node on the path. So on and so forth. Step 7: BR Step 8: BR SF: Seed-Finder NE: Neighborhood -Expander BR: Bridge

Seed-Finder ( ) Q: How to instantiate SEC node? A: Footnote `11’ is close to some un-known data nodes for `CEO’ `Account.’ and `Manager’ Next, I will briefly go through each module. First, the seed-finder module. So, again, in this graph, how can we find matching node for SEC when the resulting subgraph H_t is empty. In other words, how can we instantiate SEC node. We claim that the matching node for SEC node should be the center-piece wrt all the other attribute nodes in the augmented graph. That is, the promising SEC node should have strong connection to all the three attribute nodes, representing the accountant, CEO and manager, repectively. Moreover, in order to combine the individual rwr to compute the center-piece score, we put different weight on different rwr. For example, we put more weight on rwr from CEO, square yellow, node, than the rwr from Accoutant, red star node, since in the query graph, the SEC, green circle node is more relevant with CEO, yellow square node, than with Accountant, red star node. So on and so forth.

Neighborhood-Expander ( ) Q: How to instantiate CEO node? Step 1  Step 2? A: Footnote: Step 3  Step 4? Step 5  Step 6? Next, the neighgor-expander module, For example, in step 2, how can we find the matching node for CEO, given that we have already instantiated the SEC node. Well, again, we use Center-Piece to find the matching node, that is the matching CEO node should be the center-piece wrt node 11 and the attribute for accoutant, the red star node. Similarly, the matching manager node is the center-piece wrt node 11 and the attribute node for accoutant, the red star node. the matching accoutant node (node 4) is the center-piece wrt node 11 and the node 7

Bridge ( ) ? Q: A: Prim-like Alg. Footnote To maximize Step 6: NE Step 7: BR ? Q: A: Prim-like Alg. To maximize Should block node 11 and 7 Footnote Connection subgraph, or one single path? Finally, we use Bridge module to find a good path between two matching nodes if they are required to be connected according to the query graph. For example, in step 7, we want to a good path to connect node 12 and 4, since in the query graph H_t, the CEO node and the accountant node are required to be connected. Well, we use a prim-like alg. (the alg. that is similar to the classic alg. to find the shortest path/minimum span tree on the graph). Except that, here we claim that a good path should optimize this criteria: the ratio between the totally captured proximity score along the path and the length of the path. And also, whenever we are trying to find a new path, it should not intersect with those existing paths in the partially built subgraph. For example, the newly found path between 4 and 12 should not include the node 11 and 7.

Roadmap Introduction How to: Graph X-Ray Experimental Results Problem Definition Motivation How to: Graph X-Ray Experimental Results Conclusion Now, let’s see some experimental results.

Experimental Results Datasets DBLP Node: author (315k) Edge: co-authorship (1,800k) Attribute: conference & year (13k) KDD-2001, SIGMOD… We use DBLP to construct an attributed graph, where the nodes are authors and attribute is conference and year. The edge is constructed from co-authorship relationship.

Effectiveness: star-query Here is a star-query, we want to a star-shape group of co-authors, with one author coming from each of PODS, IAT and ISBMS. We see Dr. Phillips Yu is in the center and the rest matching authors being well known domain experts in each conf. Query Result

Effectiveness: line-query And here is a line query, we want to find authors from 4 different conferences who cooperate in a line fashion. Result

Effectiveness: loop-query And this is a loop query. Result

Efficiency Response Time # of Edges Scale linearly Small slope 3-5 Seconds This is the result on response time of G-Ray. Where x-axis is . And y-axis is… Clearly, G-Ray scales Linearly wrt the data graph. and furthermore, be careful implementation, we can make the slope very small, as the red-line. Typically, the average response-time per subgraph several seconds. # of Edges ~2 M edges

Roadmap Introduction How to: Graph X-Ray Experimental Results Problem Definition Motivation How to: Graph X-Ray Experimental Results Conclusion

Conclusion Graph X-Ray (G-Ray) More details in Poster Session Best effort pattern match in large attributed graphs Scale linearly with small slope More details in Poster Session Monday (tonight) board number 8 Well, we have introduced our work, G-Ray. It does best effort pattern match in large attributed graph and it scales linearly wrt the data graph. If you are interested in this work. Pls come to the poster session and let us discuss more details.

G-Ray X-Ray www.cs.cmu.edu/~htong Thank you!