Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Similar presentations


Presentation on theme: "CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi."— Presentation transcript:

1 CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi

2 CS 347Notes102 Web Search Engine Crawling Indexing Computing ranking features Serving queries

3 CS 347Notes103 Crawling Fetch content of web pages web init get next URL get page extract URLs seed URLs URLs to visit visited URLs web pages

4 CS 347Notes104 Issues Scope and freshness –Not enough space/time to crawl “all” pages –Page importance, quality, and update frequency –Site mirrors and (near) duplicate pages –Dynamic content and crawler traps Load at visited web sites –Rules in robots.txt –Limit number of visits per day –Limit depth of crawl

5 CS 347Notes105 Issues Load at crawler –Variance of fetch latency/bandwidth –Parallelization and scalability  Multiple agents  Partitioning URL lists  Communication between agents  Recovering from agent failure

6 CS 347Notes106 Crawl Partitioning Requirements –Each URL assigned to a single agent –Locally computable URL-to-agent mapping –Balanced distribution of URLs across agents –Contravariance

7 CS 347Notes107 Contravariance Agent A url 1 url 3 url 5 Agent B url 2 url 4 url 6 Agent A url 1 url 2 Agent B url 3 url 4 Agent C url 5 url 6

8 CS 347Notes108 Contravariance Agent A url 1 url 3 url 5 Agent B url 2 url 4 url 6 Agent A url 1 url 2 Agent B url 3 url 4 Agent C url 5 url 6 Agent A url 1 url 3 Agent B url 2 url 4 Agent C url 5 url 6

9 CS 347Notes109 Assignment Consistent hashing –Hash function: URL  agent –Each agent “replicated” k times –Each replica mapped randomly on unit circle  Mapping persistent across agent restarts –Lookup: map URL on unit circle; find closest live replica

10 CS 347Notes1010 Assignment A A B B url 6

11 CS 347Notes1011 Assignment A A B B url 6 A A B B CC Balancing Contravariance

12 CS 347Notes1012 Crawl Partitioning Ideas –URL normalization  E.g., relative to absolute URL –Host-based partitioning  Reduces communication between agents  Small vs. large hosts –Geographic distribution

13 CS 347Notes1013 Fault Tolerance Repartitioning Permanent failure –Recovering list of URLs to visit  Checkpoints  Communication logs Transient failure –Avoiding re-visiting URLs  Before fetch, check with near neighbor agents

14 CS 347Notes1014 Indexing Build term-document index ●●● ●● ●● ● ●● d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 dndn t1t1 t2t2 t3t3 t4t4 t5t5 ● t6t6 ●● tmtm Posting for t 1 Lexicon Collection

15 CS 347Notes1015 Architecture Web pages DistributorsIndexers Query servers Intermediate runs Inverted index files Reduce Map

16 CS 347Notes1016 Issues Index partitioning –Efficient query processing  Query routing  Result retrieval

17 CS 347Notes1017 Document Partitioning d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 dndn t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 tmtm d1d1 d2d2 d3d3 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 tmtm d4d4 d5d5 d6d6 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 tmtm d n-2 d n-1 dndn t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 tmtm

18 CS 347Notes1018 Document Partitioning Split the collection of documents Advantages –Easy to add new documents –Load balanced –High processing throughput Disadvantages –Communication with all query servers

19 CS 347Notes1019 Term Partitioning d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 dndn t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 tmtm d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 dndn t1t1 t2t2 t3t3 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 dndn t4t4 t5t5 t6t6 d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 dndn t m-2 t m-1 tmtm

20 CS 347Notes1020 Term Partitioning Split the lexicon Advantages –Reduced communication with query servers Disadvantages –More processing before partitioning –Adding new documents is hard –Load balancing is hard –Processing throughput limited by query length

21 CS 347Notes1021 Advanced Partitioning Topical partitioning using clustering –Documents clustered by term-similarity –Partitions made up of one or more clusters Usage-induced partitioning –Queries extracted from logs –Documents clustered by query-similarity –Partitions made up of one or more clusters

22 CS 347Notes1022 Ranking Feature Computation Parallel/distributed computation tasks –Text/language processing –Document classification/clustering –Web graph analysis

23 CS 347Notes1023 Example: PageRank Link-based global (query-independent) importance metric Random surfer model –Start at a random page –With probability d, navigate to new page by following a random link on current page –With probability (1 – d), restart at a random page PageRank score = expected fraction of time spent at a page

24 CS 347Notes1024 Formula p(x) = d ∙ Σ p(y) / out(y) + (1 – d) / n yxyx

25 CS 347Notes1025 Formula p(x) = d ∙ Σ p(y) / out(y) + (1 – d) / n PageRank of page x yxyx PageRank of y, where y links to x Out-degree of page y Probability of random restart at x

26 CS 347Notes1026 Algorithm i = 0 p [i] (x) = (1 – d) / n repeat i += 1 p [i] (x) = (1 – d) / n for all y  x p [i] (x) += d ∙ p [i–1] (y) / out(y) until | p [i] – p [i–1] | < ε

27 CS 347Notes1027 Implementation Two vectors, current and next Initialize vectors Iterate over all pages y, distribute PageRank from current(y) to next(x) for all links y  x current = next, re-initialize next Go back to iteration over pages or stop

28 CS 347Notes1028 Distribution MapReduce for each iteration i Map –Take –For each y  x in edges(y) emit –Also emit Reduce –Take and –Sum (d ∙ val) into next(x), add (1 – d) / n –Emit

29 CS 347Notes1029 Distribution Map Reduce

30 CS 347Notes1030 Query Processing Locate, retrieve, process, and serve query results Inverted index files Query coordinator Query servers Cache Query Results

31 CS 347Notes1031 Architecture Multiple sites connected by WAN –Site = coordinator + servers + cache Partitioning –Parallel processing –Distributed storage of data –E.g., index partitioning Replication –Availability –Throughput –Response time

32 CS 347Notes1032 Issues Routing the query –To sites  E.g., identical sites + routing by dynamic DNS lookup –Within sites Merging the results Caching

33 CS 347Notes1033 Issues RoutingMerging Document partition All servers Results selected by servers; ranking by coordinator Term partition Servers containing query terms Selection and ranking by coordinator

34 CS 347Notes1034 Caching What to cache? –Query answers –Term postings

35 CS 347Notes1035 Caching What to cache? –Query answers  Faster response –Term postings  More hits Query terms repeated more frequently than whole queries

36 CS 347Notes1036 Caching Policy Terms most frequent in queries  high hit ratio Terms most frequent in documents  require more cache space (longer postings) Use static caching based on query/document frequency ratio

37 CS 347Notes1037 Summary Crawling –Partitioning: balancing and contravariance –Consistent hashing Indexing –Document, term, topical, and usage-induced partitioning Computing ranking features –PageRank with MapReduce Serving queries –Routing queries, merging results, and caching postings


Download ppt "CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi."

Similar presentations


Ads by Google