Database Management Systems, R. Ramakrishnan1 Web Search Engines Chapter 27, Part C Based on Larson and Hearsts slides at UC-Berkeley

Slides:



Advertisements
Similar presentations
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Advertisements

Introduction to Information Retrieval
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Principles of IR Hacettepe University Department of Information Management DOK 324: Principles of IR.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Web Search Engines 198:541 Based on Larson and Hearst’s slides at UC-Berkeley /is202/f00/
SLIDE 1IS 202 – FALL 2004 Lecture 05: Web Search Issues and Algorithms Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday.
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
SLIDE 1IS 240 – Spring 2013 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
SLIDE 1IS 202 – FALL 2002 Lecture 20: Web Search Issues and Algorithms Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Search Engine Architecture
IST 516 Fall 2011 Dongwon Lee, Ph.D.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Thanks to Ray Mooney & Scott White
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
Web Search Engines.
Instructor : Marina Gavrilova
Presentation transcript:

Database Management Systems, R. Ramakrishnan1 Web Search Engines Chapter 27, Part C Based on Larson and Hearsts slides at UC-Berkeley

Database Management Systems, R. Ramakrishnan2 Search Engine Characteristics Unedited – anyone can enter content Quality issues; Spam Varied information types Phone book, brochures, catalogs, dissertations, news reports, weather, all in one place! Different kinds of users Lexis-Nexis: Paying, professional searchers Online catalogs: Scholars searching scholarly literature Web: Every type of person with every type of goal Scale Hundreds of millions of searches/day; billions of docs

Database Management Systems, R. Ramakrishnan3 Web Search Queries Web search queries are short: ~2.4 words on average (Aug 2000) Has increased, was 1.7 (~1997) User Expectations: Many say The first item shown should be what I want to see! This works if the user has the most popular/common notion in mind, not otherwise.

Database Management Systems, R. Ramakrishnan4 Directories vs. Search Engines Directories Hand-selected sites Search over the contents of the descriptions of the pages Organized in advance into categories Search Engines All pages in all sites Search over the contents of the pages themselves Organized in response to a query by relevance rankings or other scores

Database Management Systems, R. Ramakrishnan5 What about Ranking? Lots of variation here Often messy; details proprietary and fluctuating Combining subsets of: IR-style relevance: Based on term frequencies, proximities, position (e.g., in title), font, etc. Popularity information Link analysis information Most use a variant of vector space ranking to combine these. Heres how it might work: Make a vector of weights for each feature Multiply this by the counts for each feature

Database Management Systems, R. Ramakrishnan6 Relevance: Going Beyond IR Page popularity (e.g., DirectHit) Frequently visited pages (in general) Frequently visited pages as a result of a query Link co-citation (e.g., Google) Which sites are linked to by other sites? Draws upon sociology research on bibliographic citations to identify authoritative sources Discussed further in Google case study

Database Management Systems, R. Ramakrishnan7 Web Search Architecture

Database Management Systems, R. Ramakrishnan8 Standard Web Search Engine Architecture crawl the web create an inverted index Check for duplicates, store the documents Inverted index Search engine servers user query Show results To user DocIds

Database Management Systems, R. Ramakrishnan9 Inverted Indexes the IR Way

Database Management Systems, R. Ramakrishnan10 How Inverted Files Are Created Periodically rebuilt, static otherwise. Documents are parsed to extract tokens. These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

Database Management Systems, R. Ramakrishnan11 How Inverted Files are Created After all documents have been parsed the inverted file is sorted alphabetically.

Database Management Systems, R. Ramakrishnan12 How Inverted Files are Created Multiple term entries for a single document are merged. Within- document term frequency information is compiled.

Database Management Systems, R. Ramakrishnan13 How Inverted Files are Created Finally, the file can be split into A Dictionary or Lexicon file and A Postings file

Database Management Systems, R. Ramakrishnan14 How Inverted Files are Created Dictionary/Lexicon Postings

Database Management Systems, R. Ramakrishnan15 Inverted indexes Permit fast search for individual terms For each term, you get a list consisting of: document ID frequency of term in doc (optional) position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms

Database Management Systems, R. Ramakrishnan16 Inverted Indexes for Web Search Engines Inverted indexes are still used, even though the web is so huge. Some systems partition the indexes across different machines. Each machine handles different parts of the data. Other systems duplicate the data across many machines; queries are distributed among the machines. Most do a combination of these.

Database Management Systems, R. Ramakrishnan17 From description of the FAST search engine, by Knut Risvik m In this example, the data for the pages is partitioned across machines. Additionally, each partition is allocated multiple machines to handle the queries. Each row can handle 120 queries per second Each column can handle 7M pages To handle more queries, add another row.

Database Management Systems, R. Ramakrishnan18 Cascading Allocation of CPUs A variation on this that produces a cost- savings: Put high-quality/common pages on many machines Put lower quality/less common pages on fewer machines Query goes to high quality machines first If no hits found there, go to other machines

Database Management Systems, R. Ramakrishnan19 Web Crawling

Database Management Systems, R. Ramakrishnan20 Web Crawlers How do the web search engines get all of the items they index? Main idea: Start with known sites Record information for these sites Follow the links from each site Record information found at new sites Repeat

Database Management Systems, R. Ramakrishnan21 Web Crawling Algorithm More precisely: Put a set of known sites on a queue Repeat the following until the queue is empty: Take the first page off of the queue If this page has not yet been processed: Record the information found on this page Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processed Rule-of-thumb: 1 doc per minute per crawling server

Database Management Systems, R. Ramakrishnan22 Web Crawling Issues Keep out signs A file called norobots.txt lists off-limits directories Freshness: Figure out which pages change often, and recrawl these often. Duplicates, virtual hosts, etc. Convert page contents with a hash function Compare new pages to the hash table Lots of problems Server unavailable; incorrect html; missing links; attempts to fool search engine by giving crawler a version of the page with lots of spurious terms added... Web crawling is difficult to do robustly!

Database Management Systems, R. Ramakrishnan23 Google: A Case Study

Database Management Systems, R. Ramakrishnan24 Googles Indexing The Indexer converts each doc into a collection of hit lists and puts these into barrels, sorted by docID. It also creates a database of links. Hit: Hit type: Plain or fancy. Fancy hit: Occurs in URL, title, anchor text, metatag. Optimized representation of hits (2 bytes each). Sorter sorts each barrel by wordID to create the inverted index. It also creates a lexicon file. Lexicon: Lexicon is mostly cached in-memory

Database Management Systems, R. Ramakrishnan25 wordid#docs wordid#docs wordid#docs Lexicon (in-memory)Postings (Inverted barrels, on disk) Each barrel contains postings for a range of wordids. Googles Inverted Index Sorted by wordid Docid#hitsHit, hit, hit, hit, hit Docid#hitsHit Docid#hitsHit Docid#hitsHit, hit, hit Docid#hitsHit, hit Barrel i Barrel i+1 Sorted by Docid

Database Management Systems, R. Ramakrishnan26 Google Sorted barrels = inverted index Pagerank computed from link structure; combined with IR rank IR rank depends on TF, type of hit, hit proximity, etc. Billion documents Hundred million queries a day AND queries

Database Management Systems, R. Ramakrishnan27 Link Analysis for Ranking Pages Assumption: If the pages pointing to this page are good, then this is also a good page. References: Kleinberg 98, Page et al. 98 Draws upon earlier research in sociology and bibliometrics. Kleinbergs model includes authorities (highly referenced pages) and hubs (pages containing good reference lists). Google model is a version with no hubs, and is closely related to work on influence weights by Pinski-Narin (1976).

Database Management Systems, R. Ramakrishnan28 Link Analysis for Ranking Pages Why does this work? The official Toyota site will be linked to by lots of other official (or high-quality) sites The best Toyota fan-club site probably also has many links pointing to it Less high-quality sites do not have as many high- quality sites linking to them

Database Management Systems, R. Ramakrishnan29 PageRank Let A1, A2, …, An be the pages that point to page A. Let C(P) be the # links out of page P. The PageRank (PR) of page A is defined as: PageRank is principal eigenvector of the link matrix of the web. Can be computed as the fixpoint of the above equation. PR(A) = (1-d) + d ( PR(A1)/C(A1) + … + PR(An)/C(An) )

Database Management Systems, R. Ramakrishnan30 PageRank: User Model PageRanks form a probability distribution over web pages: sum of all pages ranks is one. User model: Random surfer selects a page, keeps clicking links (never back), until bored: then randomly selects another page and continues. PageRank(A) is the probability that such a user visits A d is the probability of getting bored at a page Google computes relevance of a page for a given search by first computing an IR relevance and then modifying that by taking into account PageRank for the top pages.

Database Management Systems, R. Ramakrishnan31 Web Search Statistics

Database Management Systems, R. Ramakrishnan32 Searches per Day

Database Management Systems, R. Ramakrishnan33 Web Search Engine Visits

Database Management Systems, R. Ramakrishnan34 Percentage of web users who visit the site shown

Database Management Systems, R. Ramakrishnan35 Search Engine Size (July 2000)

Database Management Systems, R. Ramakrishnan36 Does size matter? You cant access many hits anyhow.

Database Management Systems, R. Ramakrishnan37 Increasing numbers of indexed pages, self- reported

Database Management Systems, R. Ramakrishnan38 Web Coverage

Database Management Systems, R. Ramakrishnan39 From description of the FAST search engine, by Knut Risvik

Database Management Systems, R. Ramakrishnan40 Directory sizes