1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
GOOGLE SEARCH ENGINE Presented By Richa Manchanda.
The Search Engine Architecture CSCI 572: Information Retrieval and Search Engines Summer 2010.
Web Search – Summer Term 2006 VI. Web Search - Ranking (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
Crawling the Hidden Web Sriram Raghavan Hector Stanford University.
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Web Search – Summer Term 2006 IV. Web Search - Crawling (c) Wolfgang Hürst, Albert-Ludwigs-University.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
1 WebBase and Stanford Digital Library Project Junghoo Cho Stanford University.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University Presented By: Raffi Margaliot Ori Elkin.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Overview of Search Engines
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
Databases & Data Warehouses Chapter 3 Database Processing.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
National & Kapodistrian University of Athens Dept.of Informatics & Telecommunications MSc. in Computer Systems Technology Distributed Systems Searching.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,
How to Crawl the Web Hector Garcia-Molina Stanford University Joint work with Junghoo Cho.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Efficient Crawling Through URL Ordering By: Junghoo Cho, Hector Garcia-Molina, and Lawrence Page Presenter : Omkar S. Kasinadhuni Simerjeet Kaur.
(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
DATA MINING Introductory and Advanced Topics Part III – Web Mining
IST 516 Fall 2011 Dongwon Lee, Ph.D.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
IST 497 Vladimir Belyavskiy 11/21/02
Finding replicated web collections
Bring Order to The Web Ruey-Lung, Hsiao May 4 , 2000.
Junghoo “John” Cho UCLA
The Search Engine Architecture
Junghoo “John” Cho UCLA
Presentation transcript:

1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab

2 What is a crawler? Program that automatically retrieves pages from the Web. Widely used for search engines.

3 Challenges There are many pages out on the Web. (Major search engines indexed more than 100M pages) The size of the Web is growing enormously. Most of them are not very interesting  In most cases, it is too costly or not worthwhile to visit the entire Web space.

4 Good crawling strategy Make the crawler visit “important pages” first. Save network bandwidth Save storage space and management cost Serve quality pages to the client application

5 Outline Importance metrics : what are important pages? Crawling models : How is crawler evaluated? Experiments Conclusion & Future work

6 Importance metric The metric for determining if a page is HOT Similarity to driving query Location Metric Backlink count Page Rank

7 Similarity to a driving query Importance is measured by closeness of the page to the topic (e.g. the number of the topic word in the page) Personalized crawler Example) “Sports”, “Bill Clinton” the pages related to a specific topic

8 Importance metric The metric for determining if a page is HOT Similarity to driving query Location Metric Backlink count Page Rank

9 Backlink-based metric Backlink count number of pages pointing to the page Citation metric Page Rank weighted backlink count weight is iteratively defined

10 A B C D E F BackLinkCount(F) = 2 PageRank(F) = PageRank(E)/2 + PageRank(C)

11 Ordering metric The metric for a crawler to “estimate” the importance of a page The ordering metric can be different from the importance metric

12 Crawling models Crawl and Stop Keep crawling until the local disk space is full. Limited buffer crawl Keep crawling until the whole web space is visited throwing out seemingly unimportant pages.

Crawl and stop model

14 Crawling models Crawl and Stop Keep crawling until the local disk space is full. Limited buffer crawl Keep crawling until the whole web space is visited throwing out seemingly unimportant pages.

Limited buffer model

16 Architecture Repository URL selector Virtual Crawler HTML parser URL pool Page Info crawled page extracted URL page info selected URL WebBase Crawler Stanford WWW

17 Experiments Backlink-based importance metric backlink count PageRank Similiarty-based importance metric similarity to a query word

18 Ordering metrics in experiments Breadth first order Backlink count PageRank

20 Similarity-based crawling The content of the page is not available before it is visited Essentially, the crawler should “guess” the content of the page More difficult than backlink-based crawling

21 Promising page Sports ? Anchor Text Sports!! ? HOT Parent Page ? URL …/sports.html

22 Virtual crawler for similarity-based crawling Promising page Query word appears in its anchor text Query word appears in its URL The page pointing to it is “important” page Visit “promising pages” first Visit “non-promising pages” in the ordering metric order

24 Conclusion PageRank is generally good as an ordering metric. By applying a good ordering metric, it is possible to gather important pages quickly.