Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace.

Similar presentations


Presentation on theme: "1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace."— Presentation transcript:

1 1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace

2 2 Agenda Introduction What makes crawling hard for “beginners” What remains hard for “experts”

3 3 Introduction Web “crawling” is the primary means of obtaining data for Search Engines –Tens of billions of pages downloaded –Hundreds of billions of pages “known” –Average page <10 days old Web crawling is as old as the Web –“Large scale” crawling is about ten-years old Lots published, but still exists “secret sauce” Must support RCF –Relevance, comprehensiveness, freshness

4 4 Components of a crawler Downloaders Web DB Page processing Page storage Prioritization Feeds I’net * * Internet = DNS as well as HTTP Enrichment Click streams

5 5 Baseline challenges: overall scale 100s machines dedicated to each component Must be good at logistics (purchasing and deployment), operations, distributed programming (fault tolerance included), …

6 6 Baseline challenges: downloaders DNS scaling (multi-threading) Bandwidth –Async I/O vs. threads –Clustering/distribution Non-conformance Politeness

7 7 Baseline challenges: page processing File-cracking –HTML, Word, PDF, JPG, MPEG, … Non-conformance Higher-level processing –JavaScript, sessions, information extraction, …

8 8 Baseline challenges: Web DB and enrichment Scale –Update rate –Extraction rate Duplication detection Alias detection Checkpoints

9 9 Baseline challenges: prioritization Quality ranking Spam and crawler traps

10 10 Evergreen problems Relevance –Page quality, spam Page processing, prioritization techniques Comprehensiveness –Sheer scale Sheer machine count (expensive) Scaling of the Web DB –Deep Web, information extraction Page processing Freshness –Discovery, frequency, “long tail”

11 11 Web DB: more details For each URL, the Web DB contains: –In- and outlinks –Anchor text –Various dates: last downloaded, last changed, … –“Decorations” from various processors Language, topic, spam scores, term-vectors, fingerprints, “shingleprints,” many more… Subset of the above stored for several instances –That is, we keep track of the history of a page

12 12 Web DB: update volume When a page is downloaded, we need to update inlink and anchor-text info for each page it points to A page has ~20 outlinks on it We download 1,000’s pages per second At peak, need well over 100K updates/sec

13 13 Web DB: scaling techniques Perform updates in large batches Solves bandwidth problems… …but introduces latency problems –In particular: time to discover new links Solve latency with “short-circuit” for discovery –But this by-passes the full prioritization logic, which introduces quality problems that need to be solved with more special solutions and before long, Oi, it’s all getting very complicated…

14 14 DHTML: the enemy of crawling Increasing use of client-side scripting (aka, DHTML) is making more of the Web opaque to crawlers –AJAX: Asynchronous JavaScript and XML (The end of crawling?) Not (yet) a major barrier to Web search, but is a barrier to shopping and other specialized search, where we also have to deal with: –Form-filling and sessions –Information extraction

15 15 Conclusions Large-scale Web crawling not trivial Smart, well-funded people could figure it out from the literature But secret sauce remains in: –Prioritization –Scaling the Web DB –JavaScript, form-filling, information extraction

16 16 The future Will life get easier? –Ping plus feeds Will life get harder? –DHTML -> Ajax -> Avalon A little bit of both? –Publishers regain control –But, net, comprehensiveness improves


Download ppt "1 Web Crawling for Search: What’s hard after ten years? Raymie Stata Chief Architect, Yahoo! Search and Marketplace."

Similar presentations


Ads by Google