Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.

Similar presentations


Presentation on theme: "CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings."— Presentation transcript:

1 CS728 Lecture 16 Web indexes II

2 Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings used gap (Elias) encodings for dictionary: used pointers into string of terms Today’s lecture Indexes for connectivity testing Distance and Transitive Closure Data Structure: 2-hop covers

3 Connectivity Server Support for fast queries on the web graph –Which URLs point to a given URL? –Which URLs does a given URL point to? Stores mappings in memory from URL to outlinks, URL to inlinks Applications –Crawl control, Web graph analysis Connectivity, crawl optimization –Link analysis

4 Adjacency lists The set of neighbors of a node Assume each URL represented by an integer E.g., for a 4 billion page web, need 32 bits per node Naively, this demands 64 bits to represent each hyperlink

5 Adjacency list compression Properties exploited in compression: –Similarity (between lists) –Locality (many links from a page go to “nearby” pages) –Use gap encodings in sorted lists –Distribution of gap values

6 Storage Recently paper by Boldi/Vigna report get down to an average of ~3 bits/link –(URL to URL edge) –For a 118M node web graph How? Why is this remarkable?

7 Main ideas of Boldi/Vigna Consider lexicographically ordered list of all URLs, e.g., –www.stanford.edu/alchemy –www.stanford.edu/biology –www.stanford.edu/biology/plant –www.stanford.edu/biology/plant/copyright –www.stanford.edu/biology/plant/people –www.stanford.edu/chemistry

8 Boldi/Vigna Each of these URLs has an adjacency list Main thesis: because of templates, the adjacency list of a node is similar to one of the 7 preceding URLs in the lexicographic ordering Express adj list in terms of one of these E.g., consider these adjacency lists –1, 2, 4, 8, 16, 32, 64 –1, 4, 9, 16, 25, 36, 49, 64 –1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 –1, 4, 8, 16, 25, 36, 49, 64

9 Connectivity Queries Beyond adjacency we’d like to answer –Transitive closure: is there a path from x to y? –Distance: what is the length of shortest path from x to y? Applications –Link analysis –XML path queries with wildcards

10 Naïve Solutions Given graph –Compute and store APSPs – Answer any query in constant time – Space requirements? OR online –Given query compute SSSP –No additional space –Time to answer query?

11 Encoding Problem Find a compact representation for the transitive closure whose size is comparable to the data‘s size that supports connection tests (almost) as fast as the naive transitive closure lookup that can be built efficiently for large data sets

12 Main Idea: 2-Hop Covers and 2-Hop Labeling 2-Hop cover is set of hops (x,y) so that every connected pair is covered by 2 hops For each node a, maintain two sets of labels (which are nodes): Lin(a) and Lout(a) For each connection (a,b), –choose a node c on the path from a to b (center node) –add c to Lout(a) and to Lin(b) Then (a,b)  Transitive Closure T  Lout(a)  Lin(b)≠  acb Reachability and distance queries via 2-hop Labels (Cohen et al., SODA 2002)

13 2-hop Covers Conjecture: 2-hop covers always exist of size O(n √m ) Goal: Minimize the sum of the label sizes Problem is NP-complete –=> approximation required Theorem: There exists a polytime algorithm that approx optimal within factor of log n. Greedy (set cover) algorithm

14 124 3 5 6 (We can cover 8 connections with 6 cover entries) Approximation Algorithm What are good center nodes? Nodes that can cover many uncovered connections.  Consider the center graph of candidates initial density: 2 1 2 I 4 5 6 O 2 density of densest subgraph (here: same as initial density) Initial step: All connections are uncovered

15 Approximation Algorithm What are good center nodes? Nodes that can cover many uncovered connections. 124 3 5 6  Consider the center graph of candidates 4 initial density: density of densest subgraph = initial density (graph is complete) IO 1 2 3 4 4 5 6 Initial step: All connections are uncovered Cover connections in subgraph with greatest density with corresponding center node

16 124 3 5 6 Approximation Algorithm What are good center nodes? Nodes that can cover many uncovered connections.  Consider the center graph of candidates 2 1 2 IO 2 Next step: Some connections already covered Repeat this algorithm until all connections are covered Theorem: Generated Cover is optimal up to a logarithmic factor

17 Experimental Results Small example from real world: subset of DBLP 6,210 documents (publications) 168,991 elements 25,368 links (citations) 14Megabytes (uncompressed XML) Element-level graph has 168,991 nodes and 188,149 edges Its transitive closure: 344,992,370 connections 2,632.1 MB

18 Experimental Results For example above: Transitive Closure: 344,992,370 connections Two-Hop Cover: 1,289,930 entries  compression factor of ~267  queries are still fast (~7.6 entries/node) But: Computation took 45 hours and 80 GB RAM!

19 Why Distances are Difficult Should be simple to add: vuw L out (v)={u, …} L in (w)= {u, …} L out (v)={(u,2), …} L in (w)= {(u,4), …}  24 Is this correct... dist(v,w)=dist(v,u)+dist(u,w)=2+4=6

20 Why Distances are Difficult vuw 24 dist(v,w)=1  Center node u does not reflect the correct distance of v and w

21 Solution: Distance-aware Centergraph Add edges to the center graph only if the corresponding connection is a shortest path Correct, problems: –Expensive to build the center graph (2 additional lookups per connection) - Approx bound is no longer tight 124 3 5 6 1 2 3 4 I 4 5 6 O

22 Enhancements Allow for approx distances for more compact representations


Download ppt "CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings."

Similar presentations


Ads by Google