CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Slides:



Advertisements
Similar presentations
For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Web Crawlers Nutch. Agenda What are web crawlers Main policies in crawling Nutch Nutch architecture.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Distributed Computations
2/25/2004 The Google Cluster Architecture February 25, 2004.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
Distributed Computations MapReduce
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
Overview of Search Engines
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
A Web Crawler Design for Data Mining
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
Crawling Slides adapted from
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
MapReduce M/R slides adapted from those of Jeff Dean’s.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Parallel Crawlers Junghoo Cho (UCLA) Hector Garcia-Molina (Stanford) May 2002 Ke Gong 1.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3.
Document Clustering and Collection Selection Diego Puppin Web Mining,
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Statistics Visualizer for Crawler
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
UbiCrawler: a scalable fully distributed Web crawler
Map Reduce.
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
The Anatomy of a Large-Scale Hypertextual Web Search Engine
湖南大学-信息科学与工程学院-计算机与科学系
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Web Search Engines.
The Search Engine Architecture
Anwar Alhenshiri.
Presentation transcript:

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

CS 347Notes102 Web Search Engine Crawling Indexing Computing ranking features Serving queries

CS 347Notes103 Crawling Fetch content of web pages web init get next URL get page extract URLs seed URLs URLs to visit visited URLs web pages

CS 347Notes104 Issues Scope and freshness –Not enough space/time to crawl “all” pages –Page importance, quality, and update frequency –Site mirrors and (near) duplicate pages –Dynamic content and crawler traps Load at visited web sites –Rules in robots.txt –Limit number of visits per day –Limit depth of crawl

CS 347Notes105 Issues Load at crawler –Variance of fetch latency/bandwidth –Parallelization and scalability  Multiple agents  Partitioning URL lists  Communication between agents  Recovering from agent failure

CS 347Notes106 Crawl Partitioning Requirements –Each URL assigned to a single agent –Locally computable URL-to-agent mapping –Balanced distribution of URLs across agents –Contravariance

CS 347Notes107 Contravariance Agent A url 1 url 3 url 5 Agent B url 2 url 4 url 6 Agent A url 1 url 2 Agent B url 3 url 4 Agent C url 5 url 6

CS 347Notes108 Contravariance Agent A url 1 url 3 url 5 Agent B url 2 url 4 url 6 Agent A url 1 url 2 Agent B url 3 url 4 Agent C url 5 url 6 Agent A url 1 url 3 Agent B url 2 url 4 Agent C url 5 url 6

CS 347Notes109 Assignment Consistent hashing –Hash function: URL  agent –Each agent “replicated” k times –Each replica mapped randomly on unit circle  Mapping persistent across agent restarts –Lookup: map URL on unit circle; find closest live replica

CS 347Notes1010 Assignment A A B B url 6

CS 347Notes1011 Assignment A A B B url 6 A A B B CC Balancing Contravariance

CS 347Notes1012 Crawl Partitioning Ideas –URL normalization  E.g., relative to absolute URL –Host-based partitioning  Reduces communication between agents  Small vs. large hosts –Geographic distribution

CS 347Notes1013 Fault Tolerance Repartitioning Permanent failure –Recovering list of URLs to visit  Checkpoints  Communication logs Transient failure –Avoiding re-visiting URLs  Before fetch, check with near neighbor agents

CS 347Notes1014 Indexing Build term-document index ●●● ●● ●● ● ●● d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 dndn t1t1 t2t2 t3t3 t4t4 t5t5 ● t6t6 ●● tmtm Posting for t 1 Lexicon Collection

CS 347Notes1015 Architecture Web pages DistributorsIndexers Query servers Intermediate runs Inverted index files Reduce Map

CS 347Notes1016 Issues Index partitioning –Efficient query processing  Query routing  Result retrieval

CS 347Notes1017 Document Partitioning d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 dndn t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 tmtm d1d1 d2d2 d3d3 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 tmtm d4d4 d5d5 d6d6 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 tmtm d n-2 d n-1 dndn t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 tmtm

CS 347Notes1018 Document Partitioning Split the collection of documents Advantages –Easy to add new documents –Load balanced –High processing throughput Disadvantages –Communication with all query servers

CS 347Notes1019 Term Partitioning d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 dndn t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 tmtm d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 dndn t1t1 t2t2 t3t3 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 dndn t4t4 t5t5 t6t6 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 dndn t m-2 t m-1 tmtm

CS 347Notes1020 Term Partitioning Split the lexicon Advantages –Reduced communication with query servers Disadvantages –More processing before partitioning –Adding new documents is hard –Load balancing is hard –Processing throughput limited by query length

CS 347Notes1021 Advanced Partitioning Topical partitioning using clustering –Documents clustered by term-similarity –Partitions made up of one or more clusters Usage-induced partitioning –Queries extracted from logs –Documents clustered by query-similarity –Partitions made up of one or more clusters

CS 347Notes1022 Ranking Feature Computation Parallel/distributed computation tasks –Text/language processing –Document classification/clustering –Web graph analysis

CS 347Notes1023 Example: PageRank Link-based global (query-independent) importance metric Random surfer model –Start at a random page –With probability d, navigate to new page by following a random link on current page –With probability (1 – d), restart at a random page PageRank score = expected fraction of time spent at a page

CS 347Notes1024 Formula p(x) = d ∙ Σ p(y) / out(y) + (1 – d) / n yxyx

CS 347Notes1025 Formula p(x) = d ∙ Σ p(y) / out(y) + (1 – d) / n PageRank of page x yxyx PageRank of y, where y links to x Out-degree of page y Probability of random restart at x

CS 347Notes1026 Algorithm i = 0 p [i] (x) = (1 – d) / n repeat i += 1 p [i] (x) = (1 – d) / n for all y  x p [i] (x) += d ∙ p [i–1] (y) / out(y) until | p [i] – p [i–1] | < ε

CS 347Notes1027 Implementation Two vectors, current and next Initialize vectors Iterate over all pages y, distribute PageRank from current(y) to next(x) for all links y  x current = next, re-initialize next Go back to iteration over pages or stop

CS 347Notes1028 Distribution MapReduce for each iteration i Map –Take –For each y  x in edges(y) emit –Also emit Reduce –Take and –Sum (d ∙ val) into next(x), add (1 – d) / n –Emit

CS 347Notes1029 Distribution Map Reduce

CS 347Notes1030 Query Processing Locate, retrieve, process, and serve query results Inverted index files Query coordinator Query servers Cache Query Results

CS 347Notes1031 Architecture Multiple sites connected by WAN –Site = coordinator + servers + cache Partitioning –Parallel processing –Distributed storage of data –E.g., index partitioning Replication –Availability –Throughput –Response time

CS 347Notes1032 Issues Routing the query –To sites  E.g., identical sites + routing by dynamic DNS lookup –Within sites Merging the results Caching

CS 347Notes1033 Issues RoutingMerging Document partition All servers Results selected by servers; ranking by coordinator Term partition Servers containing query terms Selection and ranking by coordinator

CS 347Notes1034 Caching What to cache? –Query answers –Term postings

CS 347Notes1035 Caching What to cache? –Query answers  Faster response –Term postings  More hits Query terms repeated more frequently than whole queries

CS 347Notes1036 Caching Policy Terms most frequent in queries  high hit ratio Terms most frequent in documents  require more cache space (longer postings) Use static caching based on query/document frequency ratio

CS 347Notes1037 Summary Crawling –Partitioning: balancing and contravariance –Consistent hashing Indexing –Document, term, topical, and usage-induced partitioning Computing ranking features –PageRank with MapReduce Serving queries –Routing queries, merging results, and caching postings