Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Algorithm for Enumerating SCCs in Web Graph Jie Han, Yong Yu, Guowei Liu, and Guirong Xue Speaker : Seo, Jong Hwa.

Similar presentations


Presentation on theme: "An Algorithm for Enumerating SCCs in Web Graph Jie Han, Yong Yu, Guowei Liu, and Guirong Xue Speaker : Seo, Jong Hwa."— Presentation transcript:

1 An Algorithm for Enumerating SCCs in Web Graph Jie Han, Yong Yu, Guowei Liu, and Guirong Xue Speaker : Seo, Jong Hwa

2 Jong H. Seo - Realtime OS Lab 2 Table Of Contents Abstract and Introduction Related Work The Split-Merge Algorithm Experiments and Results (not covered) Conclusions (not covered)

3 Jong H. Seo - Realtime OS Lab 3 Abstract and Introduction Web graph and its connectivity Problem recognition and our goal

4 Jong H. Seo - Realtime OS Lab 4 Web Graph And Its Connectivity World Wide Web –Pages / Hyperlinks Directed Graph (Web Graph) –Nodes / Edges Connectivity analysis is an important part of the research on web graph. To study connectivity and compute structure of the web graph, SCC analysis (SCC enumeration) is most common and important.

5 Jong H. Seo - Realtime OS Lab 5 Problem Recognition When the graph contains hundreds of millions of nodes and billions of edges, it’s difficult to use traditional algorithm because of intractability for both time and space. We can hardly load the full graph into the main memory because of the large scale of the web graph.

6 Jong H. Seo - Realtime OS Lab 6 Our Goal In this paper, we investigate some properties of web graph, and propose a feasible algorithm for enumerating its web graph. The algorithm ends in a week while we can hardly apply the traditional algorithm on this web graph as it may run for years.

7 Jong H. Seo - Realtime OS Lab 7 The Following Sections Section 2 : We review some traditional algorithms for enumerating SCCs in a general directed graph. Section 3 : We describe some special properties of web graph. We propose an algorithm to enumerate SCCs in this graph. Section 4 : We discuss detailed implementation of this algorithm on the web graph in China.

8 Jong H. Seo - Realtime OS Lab 8 Related Work Broder and Kumar’s web diagram as a bowtie. Tarjan’s algorithm Sharir’s algorithm Lisa. K. Fleischer’s parallel algorithm

9 Jong H. Seo - Realtime OS Lab 9 Graph Structure In The Web (1/3) CORE : maximum SCC of the graph IN : pages from IN itself at least a path exists to some nodes in CORE OUT : pages which can be reached from some nodes in CORE TENDRILS : pages that are reachable from IN, or that can reach OUT, without passage through CORE.

10 Jong H. Seo - Realtime OS Lab 10 Graph Structure In The Web (2/3) IN can be viewed as the set of new pages that link to their interesting pages but not yet been discovered by CORE. OUT can be viewed as some well known pages whose links point to internal pages only. TENDRILS can be viewed as the pages have not yet discovered by the web.

11 Jong H. Seo - Realtime OS Lab 11 Graph Structure In The Web (3/3) The deeper analysis reveals the connectivity of the web graph. If pages u and v are randomly chosen, the probability that there exists a path from u to v is only ¼.

12 Jong H. Seo - Realtime OS Lab 12 Tarjan’s Algorithm Tarjan presented an algorithm to decompose a directed graph into strongly connected components in O(n+e), where n denotes the number of nodes and e denotes the number of edges.

13 Jong H. Seo - Realtime OS Lab 13 Sharir’s Algorithm Sharir’s algorithm finds all SCCs in a directed graph in O(n+e) time. He proposed to use the transpose of the original graph.

14 Jong H. Seo - Realtime OS Lab 14 Fleischer’s Algorithm Divide and Conquer Pred(G, v), Desc(G, v), Rem(G, v) SCC(G, v) = Pred(G, v) ∩ Desc(G, v) This algorithm works efficiently in multiprocessor based on both DFS and BFS.

15 Jong H. Seo - Realtime OS Lab 15 Introduction To The Split-Merge Algorithm (1/2) The conventional algorithms are not sometimes applicable to the web graph. Web graph consists of several hundreds of millions of nodes and several billions of edges.

16 Jong H. Seo - Realtime OS Lab 16 Introduction To The Split-Merge Algorithm (2/2) Although machines with 8GB main memory are popular in many organizations involved in web graph research and powerful algorithms of web graph compression are available, sometimes it’s still impossible to load the entire graph into main memory. The link information will be loaded from hard disk to main memory back and forth when traversing the graph. The time cost on I/O is unaffordable. So it’s infeasible to enumerate SCCs in the web graph in a straightforward way.

17 Jong H. Seo - Realtime OS Lab 17 Basic Idea On Split-Merge Algorithm (1/2) 1.Classify the nodes of graph G into n groups. Build a sub-graph with each group of nodes and the links among them. 2.Decompose each sub-graph into SCCs. If the sub-graph is small enough, use algorithm for enumerating SCCs. Otherwise, recursively apply the split-merge algorithm. 3.Assume each SCC in a sub-graph is a node and eliminate the duplicated links between them. We obtain the contracted graph G’, a graph composed of all the SCCs.

18 Jong H. Seo - Realtime OS Lab 18 Basic Idea On Split-Merge Algorithm (2/2) 4.Decompose the contracted graph G’ into SCCs. If the G’ is small enough, use any algorithm of enumerating SCCs. Otherwise, recursively apply the split-merge algorithm. 5.Merge the SCCs from sub-graphs with the help of the decomposition of G’.

19 Jong H. Seo - Realtime OS Lab 19 An Directed Graph G’ A BE F C D G H JI The directed graph G consists of 10 nodes and 15 edges. It will be split into three sub- graphs.

20 Jong H. Seo - Realtime OS Lab 20 Three Sub-Graphs A BE F C D G H JI The largest sub-graph only contains 4 nodes and 5 edges. 1 2 3

21 Jong H. Seo - Realtime OS Lab 21 The Contracted Graph G’ (1/2) After each sub-graph is decomposed, we can contract the graph as G’. G’ only contains 5 nodes and 6 edges and can be decomposed ((A, B, C), (E, F), (G, H, J), (D)) and (I) A B C E F G H I D I

22 Jong H. Seo - Realtime OS Lab 22 The Contracted Graph G’ (2/2) By merging the result from last two diagrams, we can enumerate all the SCCs in the original graph G: (A, B, C, E, F, G, H, J, D) and (I).

23 Jong H. Seo - Realtime OS Lab 23 Pros On Split-Merge Algorithm The scale of both sub-graphs and the contracted graph G’ are much smaller than that of the original graph G. If the web graph is split into sub-graphs, it’s possible to load one entire sub-graph into main memory when decomposing. Thus, the extra cost of split and merge seems to be affordable compared with swapping edges between hard disk and main memory back and forth.

24 Jong H. Seo - Realtime OS Lab 24 Cons On Split-Merge Algorithm (1/3) A BE F C D G H JI 1 2 3 Another way to split graph G

25 Jong H. Seo - Realtime OS Lab 25 Cons On Split-Merge Algorithm (2/2) A B C I G D J H E F The scale of the contracted graph G’ is only a bit smaller.

26 Jong H. Seo - Realtime OS Lab 26 Cons On Split-Merge Algorithm (3/3) The basic split-merge algorithm does not work because of the awful split. The scale of G’ is only a bit smaller. Thus, graph G’ should be split again.

27 Jong H. Seo - Realtime OS Lab 27 What Remains Now? What remains is to find a way to split the web graph appropriately. However, it seems to be difficult to do the job well if only the link information is concerned. We take advantage of special properties of the potential relationship between pages and sites in the web graph.


Download ppt "An Algorithm for Enumerating SCCs in Web Graph Jie Han, Yong Yu, Guowei Liu, and Guirong Xue Speaker : Seo, Jong Hwa."

Similar presentations


Ads by Google