CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Lecture 15. Graph Algorithms
Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
Some Graph Algorithms.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Greedy Algorithms Amihood Amir Bar-Ilan University.
Partitioned Elias-Fano Indexes
Data Structures Using C++
CS171 Introduction to Computer Science II Graphs Strike Back.
CS38 Introduction to Algorithms Lecture 5 April 15, 2014.
 Graph Graph  Types of Graphs Types of Graphs  Data Structures to Store Graphs Data Structures to Store Graphs  Graph Definitions Graph Definitions.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
02/01/11CMPUT 671 Lecture 11 CMPUT 671 Hard Problems Winter 2002 Joseph Culberson Home Page.
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Department of Computer Science, University of Maryland, College Park, USA TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
1 Greedy Algorithms. 2 2 A short list of categories Algorithm types we will consider include: Simple recursive algorithms Backtracking algorithms Divide.
CS Lecture 9 Storeing and Querying Large Web Graphs.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Chapter 9: Greedy Algorithms The Design and Analysis of Algorithms.
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
Graphs & Graph Algorithms 2 Nelson Padua-Perez Bill Pugh Department of Computer Science University of Maryland, College Park.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
Graphs & Graph Algorithms 2 Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
ECE669 L10: Graph Applications March 2, 2004 ECE 669 Parallel Computer Architecture Lecture 10 Graph Applications.
1 CS 430 / INFO 430 Information Retrieval Lecture 4 Searching Full Text 4.
Compact Representations of Separable Graphs From a paper of the same title submitted to SODA by: Dan Blandford and Guy Blelloch and Ian Kash.
The Shortest Path Problem
PrasadL16Crawling1 Crawling and Web Indexes Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford)
Chapter 9 – Graphs A graph G=(V,E) – vertices and edges
Network Aware Resource Allocation in Distributed Clouds.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 20: Crawling and web indexes.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Dijkstra’s Algorithm. Announcements Assignment #2 Due Tonight Exams Graded Assignment #3 Posted.
On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Prims’ spanning tree algorithm Given: connected graph (V, E) (sets of vertices and edges) V1= {an arbitrary node of V}; E1= {}; //inv: (V1, E1) is a tree,
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
Path-Hop: efficiently indexing large graphs for reachability queries Tylor Cai and C.K. Poon CityU of Hong Kong.
Efficient Labeling Scheme for Scale-Free Networks The scheme in detailsPerformance of the scheme First we fix the number of hubs (to O(log(N))) and show.
ITEC 2620A Introduction to Data Structures Instructor: Prof. Z. Yang Course Website: 2620a.htm Office: TEL 3049.
Compact Encodings for All Local Path Information in Web Taxonomies with Application to WordNet Svetlana Strunjaš-Yoshikawa Joint with Fred Annexstein and.
Graphs A ‘Graph’ is a diagram that shows how things are connected together. It makes no attempt to draw actual paths or routes and scale is generally inconsequential.
Weighted Graphs Computing 2 COMP s1 Sedgewick Part 5: Chapter
Union-Find  Application in Kruskal’s Algorithm  Optimizing Union and Find Methods.
1 Greedy Technique Constructs a solution to an optimization problem piece by piece through a sequence of choices that are: b feasible b locally optimal.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CSEP 521 Applied Algorithms Richard Anderson Winter 2013 Lecture 3.
Query Caching and View Selection for XML Databases Bhushan Mandhani Dan Suciu University of Washington Seattle, USA.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
ITCS 6265 Lecture 11 Crawling and web indexes. This lecture Crawling Connectivity servers.
Cohesive Subgraph Computation over Large Graphs
Lecture 17 Crawling and web indexes
Greedy Technique.
Chapter 5. Greedy Algorithms
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Graphs & Graph Algorithms 2
Randomized Algorithms CS648
ITEC 2620M Introduction to Data Structures
Presentation transcript:

CS728 Lecture 16 Web indexes II

Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings used gap (Elias) encodings for dictionary: used pointers into string of terms Today’s lecture Indexes for connectivity testing Distance and Transitive Closure Data Structure: 2-hop covers

Connectivity Server Support for fast queries on the web graph –Which URLs point to a given URL? –Which URLs does a given URL point to? Stores mappings in memory from URL to outlinks, URL to inlinks Applications –Crawl control, Web graph analysis Connectivity, crawl optimization –Link analysis

Adjacency lists The set of neighbors of a node Assume each URL represented by an integer E.g., for a 4 billion page web, need 32 bits per node Naively, this demands 64 bits to represent each hyperlink

Adjacency list compression Properties exploited in compression: –Similarity (between lists) –Locality (many links from a page go to “nearby” pages) –Use gap encodings in sorted lists –Distribution of gap values

Storage Recently paper by Boldi/Vigna report get down to an average of ~3 bits/link –(URL to URL edge) –For a 118M node web graph How? Why is this remarkable?

Main ideas of Boldi/Vigna Consider lexicographically ordered list of all URLs, e.g., – – – – – –

Boldi/Vigna Each of these URLs has an adjacency list Main thesis: because of templates, the adjacency list of a node is similar to one of the 7 preceding URLs in the lexicographic ordering Express adj list in terms of one of these E.g., consider these adjacency lists –1, 2, 4, 8, 16, 32, 64 –1, 4, 9, 16, 25, 36, 49, 64 –1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 –1, 4, 8, 16, 25, 36, 49, 64

Connectivity Queries Beyond adjacency we’d like to answer –Transitive closure: is there a path from x to y? –Distance: what is the length of shortest path from x to y? Applications –Link analysis –XML path queries with wildcards

Naïve Solutions Given graph –Compute and store APSPs – Answer any query in constant time – Space requirements? OR online –Given query compute SSSP –No additional space –Time to answer query?

Encoding Problem Find a compact representation for the transitive closure whose size is comparable to the data‘s size that supports connection tests (almost) as fast as the naive transitive closure lookup that can be built efficiently for large data sets

Main Idea: 2-Hop Covers and 2-Hop Labeling 2-Hop cover is set of hops (x,y) so that every connected pair is covered by 2 hops For each node a, maintain two sets of labels (which are nodes): Lin(a) and Lout(a) For each connection (a,b), –choose a node c on the path from a to b (center node) –add c to Lout(a) and to Lin(b) Then (a,b)  Transitive Closure T  Lout(a)  Lin(b)≠  acb Reachability and distance queries via 2-hop Labels (Cohen et al., SODA 2002)

2-hop Covers Conjecture: 2-hop covers always exist of size O(n √m ) Goal: Minimize the sum of the label sizes Problem is NP-complete –=> approximation required Theorem: There exists a polytime algorithm that approx optimal within factor of log n. Greedy (set cover) algorithm

(We can cover 8 connections with 6 cover entries) Approximation Algorithm What are good center nodes? Nodes that can cover many uncovered connections.  Consider the center graph of candidates initial density: I O 2 density of densest subgraph (here: same as initial density) Initial step: All connections are uncovered

Approximation Algorithm What are good center nodes? Nodes that can cover many uncovered connections  Consider the center graph of candidates 4 initial density: density of densest subgraph = initial density (graph is complete) IO Initial step: All connections are uncovered Cover connections in subgraph with greatest density with corresponding center node

Approximation Algorithm What are good center nodes? Nodes that can cover many uncovered connections.  Consider the center graph of candidates IO 2 Next step: Some connections already covered Repeat this algorithm until all connections are covered Theorem: Generated Cover is optimal up to a logarithmic factor

Experimental Results Small example from real world: subset of DBLP 6,210 documents (publications) 168,991 elements 25,368 links (citations) 14Megabytes (uncompressed XML) Element-level graph has 168,991 nodes and 188,149 edges Its transitive closure: 344,992,370 connections 2,632.1 MB

Experimental Results For example above: Transitive Closure: 344,992,370 connections Two-Hop Cover: 1,289,930 entries  compression factor of ~267  queries are still fast (~7.6 entries/node) But: Computation took 45 hours and 80 GB RAM!

Why Distances are Difficult Should be simple to add: vuw L out (v)={u, …} L in (w)= {u, …} L out (v)={(u,2), …} L in (w)= {(u,4), …}  24 Is this correct... dist(v,w)=dist(v,u)+dist(u,w)=2+4=6

Why Distances are Difficult vuw 24 dist(v,w)=1  Center node u does not reflect the correct distance of v and w

Solution: Distance-aware Centergraph Add edges to the center graph only if the corresponding connection is a shortest path Correct, problems: –Expensive to build the center graph (2 additional lookups per connection) - Approx bound is no longer tight I O

Enhancements Allow for approx distances for more compact representations