Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.

Similar presentations


Presentation on theme: "An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto."— Presentation transcript:

1 An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto

2 Search Engines

3 What Are They?  Tools for finding information on the Web -Problem: “hidden” databases, e.g. New York Times (i.e., databases hosted by the web site itself. These cannot be accessed by Yahoo, Google etc.)  Based on a machine-constructed index of Web contents (usually contains keywords found in the documents)  Directory of search engines: www.searchenginecolossus.com www.searchenginecolossus.com  Search engine statistics: www.searchengineshowdown.com www.searchenginewatch.com

4 What They Do 1. Acquire the document collection, e.g., web documents (off-line) 2. Create and save an inverted index (off-line) 3. Match queries to documents (on-line; the actual retrieval) 4. Present the results to user (on-line; may include summarization, extraction, translation)

5 Typical Architecture  Spider -Crawls the web to find pages by following hyperlinks -Ongoing process; never catches up  Indexer -Produces the data structures for fast searching of all words in the pages (i.e, it updates the lexicon)  Retrieval System -User interface and query language -Performs database lookup to find documents likely to be relevant -Document “relevance” based on a ranking heuristic

6 Did you know?  The concept of a Web spider was developed by Dr. Fuzzy Mouldin  Implemented in 1994 on the Web  Went into the creation of Lycos  Tangible evidence of commercial success: Newell-Simon Hall Dr. Michael L. (Fuzzy) Mauldin

7 Did you know?  Developed here at CMU by Prof. Raul Valdes-Perez and a group of graduate students in 2000  Queries other web search engines and clusters documents into categories based on content

8 A look at  10,000+ Linux servers !  Supports searches in 104 different languages  Receives over millions of searches per day  Spiders and indexes over 8 billion documents (updated monthly), encompassing HTML and 12 other file formats (e.g.,*.pdf, *.ps, *.doc)  PageRank algorithm estimates “importance” based on link counts

9 Google’s server farm

10 Why Spider the Web? User Perceptions  Most annoying: Search engine finds nothing (too small an index; less of an issue since 1997 or so).  Somewhat annoying: Obsolete links  Must regularly identify and delete dead links (Google also caches many pages)  Done every 1-2 weeks in best engines  Mildly annoying: Failure to find new site  Re-spider “entire” web  Done every 2-4 weeks in best engines

11 Cost of Spidering  Semi-parallel algorithmic decomposition  Spider can (and does) run on hundreds of severs simultaneously  Very high network connectivity  Servers can migrate from spidering to query processing depending on time-of-day load  Running a full web spider takes days even with hundreds of dedicated servers

12 Current Status of Web Spiders Enhanced Spidering  Link counts for pages can be established during spidering Unsolved Problems  Most spidering re-traverses a stable web graph; how to do on-demand re-spidering when changes occur?  Achieving complete or near-complete coverage is still a major issue  Cannot spider information stored in local databases

13 An Inverted Index DOCID OCCUR POS 1 POS 2...... “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document 56... LEXICON TERM INDEX  Data structure to permit fast searching

14 Ranking (Scoring) Documents Must display “hits” in some order... how to choose? e.g., “relevance”, recency, popularity, reliability Some ranking heuristics  Presence of search terms in title of document  Proximity of search terms to start of document  Search term occurrences within a document and the inverse frequency of a search term in a collection (common terms given less weight)  Link popularity (how many pages point to this one) Challenges  User queries often provide very limited information  Tradeoff exists between precision and recall

15 Search Engine Sizes Source: www.searchenginewatch.comwww.searchenginewatch.com AVAltavistaFAST GGGoogle INKInktomi NLNorthern Light


Download ppt "An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto."

Similar presentations


Ads by Google