IST 516 Fall 2011 Dongwon Lee, Ph.D.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Evaluating the Performance of IR Sytems
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Chapter 19: Information Retrieval
Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.
Information Retrieval
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
1 Searching through the Internet Dr. Eslam Al Maghayreh Computer Science Department Yarmouk University.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
Web- and Multimedia-based Information Systems Lecture 2.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Database System Concepts, 5th Ed. ©Sang Ho Lee Chapter 19: Information Retrieval.
Information Retrieval in Practice
Sample Test Questions These are designed to help answer the questions on Exam 2 PLEASE do not ask me to answer any of the questions.
Why indexing? For efficient searching of a document
Search Engine Architecture
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Text Based Information Retrieval
Information Retrieval
Information Retrieval and Web Search
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Search Engines & Subject Directories
אחזור מידע, מנועי חיפוש וספריות
Information Retrieval
Basic Information Retrieval
CS 440 Database Management Systems
Data Mining Chapter 6 Search Engines
Introduction to Information Retrieval
Search Engines & Subject Directories
Chapter 5: Information Retrieval and Web Search
Search Engines & Subject Directories
Web Search Engines.
Chapter 31: Information Retrieval
Information Retrieval and Web Design
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Presentation transcript:

IST 516 Fall 2011 Dongwon Lee, Ph.D. Search Engines IST 516 Fall 2011 Dongwon Lee, Ph.D.

Search Engine Overview Search Engine typically consists of: Crawler: that crawls the web to identify interesting URLs to fetch Fetcher: fetches web documents from the stored URLs Indexer: build local indexes (eg, inverted index) from the fetched web documents Query Handler: processes users’ queries using indexes and prepare answers accordingly Presenter: user interface component to present answers to users

1. Crawler (more later) Called robot or spider too View Web as a graph Employee different graph search algorithm Depth-first Breadth-first Frontier Objectives: Completeness Freshness Resource maximization

2. Fetcher Crawler generates a stack of URLs to visit Fetcher retrieves web documents for specific URL Typically multi-threaded

3. Indexer To handle large-scale data of the Web, needs to build index structure Index is a small (memory-resident) data structure to help locate data fast (at the cost of extra space) Trading space for time In Search Engine or IR, popular form of index is called the Inverted Index

Inverted Index A list for every word (index term). The list of term t holds the locations (documents + offsets) where t appeared. 6

4. Query Handler Given a query keyword Q, which web documents are the right answers? Eg, Boolean-matching model Return all documents that contain Q Vector space model Return a ranked list of documents that have the largest cosine() similarity to Q PageRank model Return a ranked list of documents that have the highest PageRank values, in addition to other matching model

Boolean-Matching

Vector Space Model Cosine similarity: similarity of two same-length vectors measured by the cosine angle between them Range: -1 -- +1 Dot product: V1 . V2 Magnitude: || V1 ||

Vector Space Model Given N web documents gathered, extract all significant token set (ie words), say T. |T| becomes the dimension of the vector Convert each web document w_i to |T|-length boolean vector, say V_w_i Given a query string Q, convert it to |T|-length boolean vector, say V_q Compute cosine similarity btw V_q and V_w_i Sort similarity scores in descending order

Vector Space Model Example 3 documents D1 = {penn, state, football} D2 = {state, gov} D3 = {psu, football} Vector space representation V = {football, gov, penn, psu, state} D1 = [1, 0, 1, 0, 1] D2 = [0, 1, 0, 0, 1] D3 = [1, 0, 0, 1, 0] Query Q = {state, football} Q1 = [1, 0, 0, 0, 1] Which doc is the closest to Q?

Term-Weighting Instead of Boolean vector-space model, each term in dimension carries an importance weight of the corresponding token [1, 0, 0, 1, 0 ]  [0.9, 0.12, 0.14, 0.89, 0.13]

Term-Weighting Eg, In the tf-idf, the importance of a term t Increases proportionally to # of times t appears in the document  tf Is offset by # of times t appears in corpus  idf

PageRank Link graph of the web Prioritizes the results of a search d: damping factor (eg, 0.85) C(Ti): # of outgoing links of the page Ti High PageRank Many pages pointing to it Some pages pointing to it have high PageRank

PageRank Eg, PR(A)=(1-d)+d(PR(T1)/C(T1)+…+(PR(Tn)/C(Tn)) = 8 4 ? 1 2

Effect of Link Structure (1) Example for simple PR Initial RPs of pages A, B, and C are all 0.15 Page A = 0.15 Page A = 1 Page A = 1.85 Page A = 1.4594 Page B = 0.2775 Page B = 1 Page B = 0.575 Page B = 1 Page C = 0.15 Page C = 1 Page C = 0.575 Page C = 0.575 Page A Page B Page A Page B Page A Page B Page A Page B Page C Page C Page C Page C

Effect of Link Structure (2) Example of practical PR rank sink : no outbound links, PR0 Page A Page B Page A Page B Page A : 0.6277 Page B : 0.6836 Page C : 0.4405 Page D : 1.5735 Page E : 1.6747 Page A : 1.1922 Page B : 1.1634 Page C : 0.6444 Page D : 0.1500 Page E : 0.1500 Page C Page C Page D Page E Page D Page E Simple PR Practical PR Sink

5. Presenter Different presentation models Simple keyword vs. Advanced interface

5. Presenter Different presentation models Ranked list: Google, Bing vs. Clustered list: Yippy Clusters

Evaluation of Results The deciding factor for a search engine is it’s effectiveness. Two factors: Precision — The percentage of the relevant documents returned that are actually about the particular query. Recall — The percentage of the relevant documents that were actually returned from the available set of documents for a particular query. 20

Evaluation of Results (cont.) T: True-Relevant documents R: Retrieved documents Precision = (T intersect R) / R Recall = (T intersect R) / T F-measure = 2 X Precision X Recall / (Precision + Recall) P-R graph P 21 R

A Lot More Details Later Through the second half of the semester, we will review Each component of search engines and Principles behind it in more details We will use materials from the IIR textbook  Contents are freely available at: http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html