Web Search Engines.

Slides:



Advertisements
Similar presentations
Database Management Systems, R. Ramakrishnan1 Web Search Engines Chapter 27, Part C Based on Larson and Hearsts slides at UC-Berkeley
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Principles of IR Hacettepe University Department of Information Management DOK 324: Principles of IR.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Web Search Engines 198:541 Based on Larson and Hearst’s slides at UC-Berkeley /is202/f00/
Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page Distributed Systems - Presentation 6/3/2002 Nancy Alexopoulou.
Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Google and Scalable Query Services
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Kevin Mauricio Apaza Huaranca San Pablo Catholic University.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
1 Web Search Engines. 2 Search Engine Characteristics  Unedited – anyone can enter content Quality issues; Spam  Varied information types Phone book,
The Anatomy of a Large-Scale Hyper-textual Web Search Engine 전자전기컴퓨터공학과 G 김영제 Database Lab.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Information Retrieval in Practice
Why indexing? For efficient searching of a document
How do Web Applications Work?
Search Engine Architecture
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
The Anatomy Of A Large Scale Search Engine
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Google and Scalable Query Services
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Anatomy of a search engine
Data Mining Chapter 6 Search Engines
Sergey Brin, lawrence Page, The anatomy of a large scale hypertextual web search Engine Rogier Brussee ICI
The Search Engine Architecture
Instructor : Marina Gavrilova
Presentation transcript:

Web Search Engines

IR System Overview Document corpus Query String IR System Ranked Documents 1. Doc1 2. Doc2 3. Doc3 .

Search engine characteristics Unedited – anyone can enter Quality issues Spam Varied information types Phone book, brochures, catalogs, dissertations, news reports, weather, all in one place! Different kinds of users Online catalogs scholars searching scholarly literature Web Every type of person with every type of goal Scale Hundreds of millions of searches/day; billions of docs

Web Search Queries Web search queries are SHORT User Expectations ~2.4 words on average (Aug 2000) Has increased, was 1.7 (~1997) User Expectations Many say “the first item shown should be what I want to see”! This works if the user has the most popular/common notion in mind

Directories vs. Search Engines Hand-selected sites Search over the contents of the descriptions of the pages Organized in advance into categories Search Engines All pages in all sites Search over the contents of the pages themselves Organized after the query by relevance rankings or other scores

What about Ranking? Lots of variation here Combining subsets of: Often messy; details proprietary and fluctuating Combining subsets of: IR-style relevance: Based on term frequencies, proximities, position (e.g., in title), font, etc. Popularity information Link analysis information Most use a variant of vector space ranking to combine these. Here’s how it might work: Make a vector of weights for each feature Multiply this by the counts for each feature

What is Really Being Used? Today's search engines combine these methods in various ways Integration of Directories Today most web search engines integrate categories into the results listings Lycos, MSN, Google Link “co-citation” Words on the links seems to be especially useful Which sites are linked to by other sites? Google uses it; others are using it or will soon Page popularity Frequently visited pages (in general) Frequently visited pages as a result of a query Many use DirectHit’s popularity rankings

Web Spam What are the types of Web spam? Add extra terms to get a higher ranking Repeat “cars” thousands of times Add irrelevant terms to get more hits Put a dictionary in the comments field Put extra terms in the same color as the background of the web page Add irrelevant terms to get different types of hits Put “Madonna” in the title field in sites that are selling cars Add irrelevant links to boost your link analysis ranking There is a constant “arms race” between web search companies and spammers

Web Search Architecture

Standard Web Search Engine Architecture Check for duplicates, store the documents DocIds crawl the web user query create an inverted index Inverted index Search engine servers Show results To user

How Inverted Files are Created?

Inverted indexes Permit fast search for individual terms For each term, you get a list consisting of: document ID frequency of term in doc (optional) position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms

Inverted Indexes for Web Search Engines Inverted indexes are still used, even though the web is so huge Some systems partition the indexes across different machines; each machine handles different parts of the data Other systems duplicate the data across many machines; queries are distributed among the machines Most do a combination of these

Web Crawlers How do the web search engines get all of the items they index? Main idea: Start with known sites Record information for these sites Follow the links from each site Record information found at new sites Repeat

Web Crawling Algoritgm More precisely: Put a set of known sites on a queue Repeat the following until the queue is empty: Take the first page off of the queue If this page has not yet been processed: Record the information found on this page Positions of words, links going out, etc Add each link on the current page to the queue Record that this page has been processed Rule-of-thumb: 1 doc per minute per crawling server

Web Crawling Issues Keep out signs Freshness A file called norobots.txt tells the crawler which directories are off limits Freshness Figure out which pages change often Recrawl these often Duplicates, virtual hosts, etc Convert page contents with a hash function Compare new pages to the hash table Lots of problems Server unavailable Incorrect html Missing links Infinite loops Web crawling is difficult to do robustly!

Google Search Engine Features Two main features to increase result precision: Uses link structure of web (PageRank) Uses text surrounding hyperlinks to improve accurate document retrieval Other features include: Takes into account word proximity in documents Uses font size, word position, etc. to weight word Storage of full raw html pages

Google Sorted barrels = inverted index Pagerank computed from link structure; combined with IR rank IR rank depends on TF, type of “hit”, hit proximity, etc. Billion documents Hundred million queries a day AND queries

Google’s Indexing The Indexer converts each doc into a collection of “hit lists” and puts these into “barrels”, sorted by docID. It also creates a database of “links”. Hit: <wordID, position in doc, font info, hit type> Hit type: Plain or fancy. Fancy hit: Occurs in URL, title, anchor text, metatag. Optimized representation of hits (2 bytes each). Sorter sorts each barrel by wordID to create the inverted index. It also creates a lexicon file. Lexicon: <wordID, offset into inverted index> Lexicon is mostly cached in-memory

Google’s Inverted Index Each “barrel” contains postings for a range of wordids.

Link Analysis for Ranking Pages Assumption: If the pages pointing to this page are good, then this is also a good page. Why does this work? The official Toyota site will be linked to by lots of other official (or high-quality) sites The best Toyota fan-club site probably also has many links pointing to it Less high-quality sites do not have as many high quality sites linking to them

PageRank Let A1, A2, …, An be the pages that point to page A. Let C(P) be the # links out of page P. The PageRank (PR) of page A is defined as: PageRanks form a probability distribution over web pages: sum of all pages’ ranks is one

PageRank: User Model User model: “Random surfer” selects a page, keeps clicking links (never “back”), until “bored”: then randomly selects another page and continues. PageRank(A) is the probability that such a user visits A d is the probability of getting bored at a page Google computes relevance of a page for a given search by first computing an IR relevance and then modifying that by taking into account PageRank for the top pages.