Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad

Similar presentations


Presentation on theme: "Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad"— Presentation transcript:

1 Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad
L L N L Graph X-Ray: Fast Best-Effort Pattern Matching in Large Attributed Graphs Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad 8/13/2007 KDD 2007, San Jose

2 Input Output Query Graph Matching Subgraph Attributed Data Graph
Let me start with a synthetic example to illustrate what we want to do. Given a large attributed graph G whose nodes are associated with one categorical attribute. For example given a who-talk-to-whom graph, whose nodes have the job title as attribute, such as CEO, SCE, Accoutant and Manager. , and given a query graph H_q, we want to find the graph H_t that matches the query graph as well as possible. For example, here, Given a loop query H_q, its matching subgraph H_t is shown in the right figure. Next, we will use this example to explain some terminologies involved in G-Ray. Matching Subgraph Attributed Data Graph

3 Terminology: ``Conform’’
First, We say the subgraph H_t conforms the query graph H_q, if we have all desired job titles and connection between them. Matching Subgraph conforms Query Graph

4 Terminology: ``Interception’’
Intermediate node matching node matching node matching node matching node We allow the in-directed connection by introducing some extra nodes. For example, the connection between 12 and 4 is indirected. We refer this phenomena as interception, and the extra nodes, e.g. node 13 as intermediate node. And all remaining nodes as matching nodes, e.g. node 11 12,4 and 7. Matching Subgraph Query Graph Path is an Interception

5 Terminology: ``Instantiate’’
Matching Subgraph Ht Query Graph Hq Whenever we have a matching subgraph H_t, we say H_t instantiates the query graph H_q. and the matching nodes in H_t instantiates the nodes in the query graph. for example, we say node 11 in H_t instantiates the SEC node in the query graph, and so on. Node 11 instantiates SEC node Ht instantiates Hq

6 Roadmap Introduction How to: Graph X-Ray Experimental Results
Problem Definition Motivations How to: Graph X-Ray Experimental Results Conclusion we have introduced the problem definition and some necessary terminologies. So, why do we care this problem?

7 Motivation: Why Not SQL?
Case 1: Exact match does not exist Q: How to find approximate answer? Case 2: Too many exact matches Q: How to rank them? At first glance, we can use standard sql to address this problem. However, SQL will return no result if exact match does not exist. On the other extreme case, if we have many exact matches, we would like to automatically rank them according to some goodness function

8 Motivation: Why Not SQL? (Cont.)
Case 3: Exact match might be not the best answer ``Find CEO who has heavy contact with Accountant’’ Q: how to find right? Furthermore, I want claim that in some case, exact match might not be always the best answer in some scenarios. For example, if we want to find CEO…… The left subgraph is an exact match and there is one direct connection between CEO 12 and Accountant 1. While, the right subgraph is an inexact match and there are lots of indirect connections between CEO and Accoutant. In this case, the inexact match might be a better answer since it reflects the real suspicious relationship. Exact match 1 direct connection Inexact match Many indirect connections

9 Motivation: Efficiency
Why Not Subgraph Isomorphism? Polynomial for fixed # of pattern query Q1: How to scale up linearly? Q2: … and with a small slope? In terms of computational issue, our problem is polynomial wrt the size of the data graph for a fixed size of pattern query, which is prohibitive for large graphs So, how can we develop an approximate alg. which is linearly wrt to the data graph. And furthermore, we would like to such linear alg. scales with a small slope so that the response time is fast.

10 Wish List G-Ray meets all! Effectiveness Efficiency
Both exact match & inexact Match Ranking among multiple results ``Best’’ answer (proximity-based) Efficiency Scale linearly Scale with small scope To summarize, this is our wish list and without going into the details. I want to claim that our method G-Ray meets all of these requirements. G-Ray meets all!

11 Roadmap Introduction How to: Graph X-Ray Experimental Results
Problem Definition Motivations How to: Graph X-Ray Experimental Results Conclusion Next, I will introduce how G-Ray works . There are two key concepts behind G-Ray. Once we have clarified these concepts, the alg. itself is quite straight-forward.

12 Preliminary: Center-Piece Subgraph [Tong+]
Q The first concept behind G-Ray is CenterPiece subgraph, given some query nodes in a graph, how can we find the nodes that have strong connection to all/most of the query node. For example, …… . Originally, CePS is designed for plain graph (i.e. no attribute on node). In Gray, We use CePS as basic operation Original Graph Black: query nodes CePS is meta opt. in G-Ray!

13 Preliminary: Augmented Graph
Data nodes 1,…13 Attribute nodes a Ok, another key concept is augmented graph. Given an attributed graph, we augment it with some additional nodes, one for each attribute value. For example, for the attributed graph we show at the very beginning, we introduce four additional nodes, red star node for accountant, yellow square for CEO and so on. We refer to these newly added nodes as attribute nodes, and the original nodes as data nodes. Furthermore, we put a directed edge from the attribute node to each data node having that attribute value. An important observation in the augmented graph is that, if we measure the proximity between an attribute node and a data node, e.g. the prox between red star node (the accountant node) and data node 11. That proximity is proportional to the average proximity score between node 11 and all data nodes that have attribute value accountant. Without telling you the detail, I would like to mention that this operation will help to reduce the computational time a lot.. Footnote Aug. Graph is crucial for computation!

14 G-Ray: quick overview (for loop )
Step 1: SF Step 2: NE Step 3: BR Step 4: NE Step 5: BR Step 6: NE Now, we are ready to give the alg. details. So, here is how the algorithm find the matching subgraph for the loop query shown at the very beginning. In G-Ray, we build the matching subgraph H_t gradually. There are three modules in the alg. First, (at step 1, ) it calls the seed-finder module to find a very promising matching data node with some attribute value according to the query graph when the resulting subgraph H_t is empty. So, here we call seed-finder module to find node 11 as the matching node for the SEC node in the query graph. Node 11 is referred as the seed. Then, it recursively calls the neighbor-expander and bridge until we find a complete matching subgraph. In neighbor-exapnder (as in step 2, 4, and 6), it expands the seed by finding a good matching node with desired attributed value according to the query graph, when the resulting subgraph H_t is partially built. For example, in step 2, we find node 12 as the matching node for CEO node in the query graph, and so on and so forth. In Bridge, we find a good path(as in step 3, 5 7 and 8) between two matching nodes if they are required to be connected according the query graph. For example, in step 7, we find a path to connect node 12 and 4, with node 13 as an intermedidate node on the path. So on and so forth. Step 7: BR Step 8: BR SF: Seed-Finder NE: Neighborhood -Expander BR: Bridge

15 Seed-Finder ( ) Q: How to instantiate SEC node? A: Footnote
`11’ is close to some un-known data nodes for `CEO’ `Account.’ and `Manager’ Next, I will briefly go through each module. First, the seed-finder module. So, again, in this graph, how can we find matching node for SEC when the resulting subgraph H_t is empty. In other words, how can we instantiate SEC node. We claim that the matching node for SEC node should be the center-piece wrt all the other attribute nodes in the augmented graph. That is, the promising SEC node should have strong connection to all the three attribute nodes, representing the accountant, CEO and manager, repectively. Moreover, in order to combine the individual rwr to compute the center-piece score, we put different weight on different rwr. For example, we put more weight on rwr from CEO, square yellow, node, than the rwr from Accoutant, red star node, since in the query graph, the SEC, green circle node is more relevant with CEO, yellow square node, than with Accountant, red star node. So on and so forth.

16 Neighborhood-Expander ( )
Q: How to instantiate CEO node? Step 1  Step 2? A: Footnote: Step 3  Step 4? Step 5  Step 6? Next, the neighgor-expander module, For example, in step 2, how can we find the matching node for CEO, given that we have already instantiated the SEC node. Well, again, we use Center-Piece to find the matching node, that is the matching CEO node should be the center-piece wrt node 11 and the attribute for accoutant, the red star node. Similarly, the matching manager node is the center-piece wrt node 11 and the attribute node for accoutant, the red star node. the matching accoutant node (node 4) is the center-piece wrt node 11 and the node 7

17 Bridge ( ) ? Q: A: Prim-like Alg. Footnote To maximize
Step 6: NE Step 7: BR ? Q: A: Prim-like Alg. To maximize Should block node 11 and 7 Footnote Connection subgraph, or one single path? Finally, we use Bridge module to find a good path between two matching nodes if they are required to be connected according to the query graph. For example, in step 7, we want to a good path to connect node 12 and 4, since in the query graph H_t, the CEO node and the accountant node are required to be connected. Well, we use a prim-like alg. (the alg. that is similar to the classic alg. to find the shortest path/minimum span tree on the graph). Except that, here we claim that a good path should optimize this criteria: the ratio between the totally captured proximity score along the path and the length of the path. And also, whenever we are trying to find a new path, it should not intersect with those existing paths in the partially built subgraph. For example, the newly found path between 4 and 12 should not include the node 11 and 7.

18 Roadmap Introduction How to: Graph X-Ray Experimental Results
Problem Definition Motivation How to: Graph X-Ray Experimental Results Conclusion Now, let’s see some experimental results.

19 Experimental Results Datasets DBLP Node: author (315k)
Edge: co-authorship (1,800k) Attribute: conference & year (13k) KDD-2001, SIGMOD… We use DBLP to construct an attributed graph, where the nodes are authors and attribute is conference and year. The edge is constructed from co-authorship relationship.

20 Effectiveness: star-query
Here is a star-query, we want to a star-shape group of co-authors, with one author coming from each of PODS, IAT and ISBMS. We see Dr. Phillips Yu is in the center and the rest matching authors being well known domain experts in each conf. Query Result

21 Effectiveness: line-query
And here is a line query, we want to find authors from 4 different conferences who cooperate in a line fashion. Result

22 Effectiveness: loop-query
And this is a loop query. Result

23 Efficiency Response Time # of Edges Scale linearly Small slope
3-5 Seconds This is the result on response time of G-Ray. Where x-axis is . And y-axis is… Clearly, G-Ray scales Linearly wrt the data graph. and furthermore, be careful implementation, we can make the slope very small, as the red-line. Typically, the average response-time per subgraph several seconds. # of Edges ~2 M edges

24 Roadmap Introduction How to: Graph X-Ray Experimental Results
Problem Definition Motivation How to: Graph X-Ray Experimental Results Conclusion

25 Conclusion Graph X-Ray (G-Ray) More details in Poster Session
Best effort pattern match in large attributed graphs Scale linearly with small slope More details in Poster Session Monday (tonight) board number 8 Well, we have introduced our work, G-Ray. It does best effort pattern match in large attributed graph and it scales linearly wrt the data graph. If you are interested in this work. Pls come to the poster session and let us discuss more details.

26 G-Ray X-Ray Thank you!


Download ppt "Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad"

Similar presentations


Ads by Google