IST 516 Fall 2011 Dongwon Lee, Ph.D.

IST 516 Fall 2011 Dongwon Lee, Ph.D.
Search Engines IST 516 Fall 2011 Dongwon Lee, Ph.D.

Search Engine Overview
Search Engine typically consists of: Crawler: that crawls the web to identify interesting URLs to fetch Fetcher: fetches web documents from the stored URLs Indexer: build local indexes (eg, inverted index) from the fetched web documents Query Handler: processes users’ queries using indexes and prepare answers accordingly Presenter: user interface component to present answers to users

1. Crawler (more later) Called robot or spider too View Web as a graph
Employee different graph search algorithm Depth-first Breadth-first Frontier Objectives: Completeness Freshness Resource maximization

2. Fetcher Crawler generates a stack of URLs to visit
Fetcher retrieves web documents for specific URL Typically multi-threaded

3. Indexer To handle large-scale data of the Web, needs to build index structure Index is a small (memory-resident) data structure to help locate data fast (at the cost of extra space) Trading space for time In Search Engine or IR, popular form of index is called the Inverted Index

Inverted Index A list for every word (index term).
The list of term t holds the locations (documents + offsets) where t appeared. 6

4. Query Handler Given a query keyword Q, which web documents are the right answers? Eg, Boolean-matching model Return all documents that contain Q Vector space model Return a ranked list of documents that have the largest cosine() similarity to Q PageRank model Return a ranked list of documents that have the highest PageRank values, in addition to other matching model

Boolean-Matching

Vector Space Model Cosine similarity: similarity of two same-length vectors measured by the cosine angle between them Range: Dot product: V1 . V2 Magnitude: || V1 ||

Vector Space Model Given N web documents gathered, extract all significant token set (ie words), say T. |T| becomes the dimension of the vector Convert each web document w_i to |T|-length boolean vector, say V_w_i Given a query string Q, convert it to |T|-length boolean vector, say V_q Compute cosine similarity btw V_q and V_w_i Sort similarity scores in descending order

Vector Space Model Example
3 documents D1 = {penn, state, football} D2 = {state, gov} D3 = {psu, football} Vector space representation V = {football, gov, penn, psu, state} D1 = [1, 0, 1, 0, 1] D2 = [0, 1, 0, 0, 1] D3 = [1, 0, 0, 1, 0] Query Q = {state, football} Q1 = [1, 0, 0, 0, 1] Which doc is the closest to Q?

Term-Weighting Instead of Boolean vector-space model, each term in dimension carries an importance weight of the corresponding token [1, 0, 0, 1, 0 ]  [0.9, 0.12, 0.14, 0.89, 0.13]

Term-Weighting Eg, In the tf-idf, the importance of a term t
Increases proportionally to # of times t appears in the document  tf Is offset by # of times t appears in corpus  idf

PageRank Link graph of the web Prioritizes the results of a search
d: damping factor (eg, 0.85) C(Ti): # of outgoing links of the page Ti High PageRank Many pages pointing to it Some pages pointing to it have high PageRank

PageRank Eg, PR(A)=(1-d)+d(PR(T1)/C(T1)+…+(PR(Tn)/C(Tn)) = 8 4 ? 1 2

Effect of Link Structure (1)
Example for simple PR Initial RPs of pages A, B, and C are all 0.15 Page A = Page A = Page A = Page A = Page B = Page B = Page B = Page B = 1 Page C = Page C = Page C = Page C = 0.575 Page A Page B Page A Page B Page A Page B Page A Page B Page C Page C Page C Page C

Effect of Link Structure (2)
Example of practical PR rank sink : no outbound links, PR0 Page A Page B Page A Page B Page A : Page B : Page C : Page D : Page E : Page A : Page B : Page C : Page D : Page E : Page C Page C Page D Page E Page D Page E Simple PR Practical PR Sink

5. Presenter Different presentation models
Simple keyword vs. Advanced interface

5. Presenter Different presentation models
Ranked list: Google, Bing vs. Clustered list: Yippy Clusters

Evaluation of Results The deciding factor for a search engine is it’s effectiveness. Two factors: Precision — The percentage of the relevant documents returned that are actually about the particular query. Recall — The percentage of the relevant documents that were actually returned from the available set of documents for a particular query. 20

Evaluation of Results (cont.)
T: True-Relevant documents R: Retrieved documents Precision = (T intersect R) / R Recall = (T intersect R) / T F-measure = 2 X Precision X Recall / (Precision + Recall) P-R graph P 21 R

A Lot More Details Later
Through the second half of the semester, we will review Each component of search engines and Principles behind it in more details We will use materials from the IIR textbook  Contents are freely available at:

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Similar presentations

Presentation on theme: "IST 516 Fall 2011 Dongwon Lee, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Similar presentations

Presentation on theme: "IST 516 Fall 2011 Dongwon Lee, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback