Quality of a search engine

Slides:



Advertisements
Similar presentations
Lecture 18: Link analysis
Advertisements

Markov Models.
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
Link Analysis: PageRank
Link Analysis David Kauchak cs160 Fall 2009 adapted from:
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Link Analysis, PageRank and Search Engines on the Web
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Information Retrieval
Information Retrieval Link-based Ranking. Ranking is crucial… “.. From our experimental data, we could observe that the top 20% of the pages with the.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis.
ITCS 6265 Lecture 17 Link Analysis This lecture Anchor text Link analysis for ranking Pagerank and variants HITS.
1 Random Walks on Graphs: An Overview Purnamrita Sarkar, CMU Shortened and modified by Longin Jan Latecki.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Information Retrieval Quality of a Search Engine.
Random Sampling Algorithms with Applications Kyomin Jung KAIST Aug ERC Workshop.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Modified by Dongwon Lee from slides by
Top-K documents Exact retrieval
The PageRank Citation Ranking: Bringing Order to the Web
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Link analysis and Page Rank Algorithm
Information Retrieval Christopher Manning and Prabhakar Raghavan
Link-Based Ranking Seminar Social Media Mining University UC3M
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Compact Query Term Selection Using Topically Related Text
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Anatomy of a search engine
CS 440 Database Management Systems
PageRank algorithm based on Eigenvectors
Link Structure Analysis
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Graph and Link Mining.
Query processing: phrase queries and positional indexes
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Presentation transcript:

Quality of a search engine Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa

Is it good ? How fast does it index How fast does it search Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Is it good ? How fast does it index Number of documents/hour (Average document size) How fast does it search Latency as a function of index size Expressiveness of the query language

Measures for a search engine Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Measures for a search engine All of the preceding criteria are measurable The key measure: user happiness …useless answers won’t make a user happy User groups for testing !!

General scenario collection Retrieved Relevant Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" General scenario Relevant Retrieved collection

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Precision vs. Recall Precision: % docs retrieved that are relevant [issue “junk” found] Recall: % docs relevant that are retrieved [issue “info” found] collection Retrieved Relevant

Precision P = tp/(tp + fp) Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" How to compute them Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) Relevant Not Relevant Retrieved tp (true positive) fp (false positive) Not Retrieved fn (false negative) tn (true negative)

Precision-Recall curve Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Precision-Recall curve Measure Precision at various levels of Recall precision x x x x recall

A common picture x precision x x x recall Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" A common picture x precision x x x recall

F measure Combined measure (weighted harmonic mean): Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" F measure Combined measure (weighted harmonic mean): People usually use balanced F1 measure i.e., with  = ½ thus 1/F = ½ (1/P + 1/R)

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Ranking Link-based Ranking (2° generation) Reading 21

The Web as a Directed Graph Sec. 21.1 The Web as a Directed Graph Page A hyperlink Page B Anchor Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal) Assumption 2: The text in the anchor of the hyperlink describes the target page (textual context)

Anchor Text For ibm how to distinguish between: Sec. 21.1.1 Anchor Text For ibm how to distinguish between: IBM’s home page (mostly graphical) IBM’s copyright page (high term freq. for ‘ibm’) Rival’s spam page (arbitrarily high term freq.) “IBM home page” “ibm.com” “ibm” A million pieces of anchor text with “ibm” send a strong signal www.ibm.com

Sec. 21.1.1 Indexing anchor text When indexing a document D, include anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today www.ibm.com Big Blue today announced record profits for the quarter Joe’s computer hardware links Sun HP IBM

Sec. 21.1.1 Indexing anchor text Can sometime have unexpected side effects - e.g., evil empire. Can score anchor text with weight depending on the authority of the anchor page’s website

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Random Walks Paolo Ferragina Dipartimento di Informatica Università di Pisa

Definitions Adjacency matrix A Transition matrix P 1 2 3 1 2 3 1/2 1 2 3 Any edge weigthing is possible: Any proposals?

What is a random walk t=0 1 1/2

What is a random walk 1 1/2 t=1 t=0 1 1/2

What is a random walk 1 1/2 t=0 1 1/2 t=1 t=2 1 1/2

What is a random walk 1 1/2 t=1 t=0 1 1/2 t=2 1 1/2 t=3 1 1/2

Probability Distributions xt(i) = probability that surfer is at node i at time t xt+1(i) = ∑j (Probability of being at node j)*Pr(j->i) = ∑j xt(j)*P(j,i) = xt * P t=2 1 2 3 1/2 Transition matrix P 1/2 xt+1 0 0 1 xt =

Probability Distributions xt(i) = probability that surfer is at node i at time t xt+1(i) = ∑j (Probability of being at node j)*Pr(j->i) = ∑j xt(j)*P(j,i) = xt P xt+1 = xt P = xt-1* P * P = xt-2* P * P * P = …= x0 Pt+1 What happens when the surfer keeps walking for a long time? So called Stationary distribution

Stationary Distribution The stationary distribution at a node is related to the amount/proportion of time a random walker spends visiting that node. It is when the distribution does not change anymore: i.e. xT+1 = xT  xT P = 1* xT (left eigenvector of eigenvalue 1) For “well-behaved” graphs this does not depend on the start distribution: xt+1 = x0 P t+1

Interesting questions Does a stationary distribution always exist? Is it unique? Yes, if the graph is “well-behaved”, namely the markov chain is irreducible and aperiodic. How fast will the random surfer approach this stationary distribution? Mixing Time!

Well behaved graphs Irreducible: There is a path from every node to every other node ( it is an SCC). Irreducible Not irreducible

Well behaved graphs Aperiodic: The GCD of all cycle lengths is 1. The GCD is also called period. Aperiodic Periodicity is 3

About undirected graphs A connected undirected graph is irreducible A connected non-bipartite undirected graph has a stationary distribution proportional to the degree distribution! Makes sense, since larger the degree of the node more likely a random walk is to come back to it.

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" PageRanks and HITS Paolo Ferragina Dipartimento di Informatica Università di Pisa

Query-independent ordering Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Query-independent ordering First generation: using link counts as simple measures of popularity. Undirected popularity: Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5). Directed popularity: Score of a page = number of its in-links (es. 3). Easy to SPAM

Second generation: PageRank Deploy the graph structure, and each link has its own importance ! PageRank is independent of the query many interpretations: Linear algebra – eigenvectors, eigenvalues Markov chains – steady state probability distribution Social interpretation – a sort of voting scheme Paolo Ferragina, Web Algorithmics

Basic Intuition… d 1-d Random jump to any node Random jump to neighbors d Paolo Ferragina, Web Algorithmics

Google’s Pagerank Random jump B(i) : set of pages linking to i. Principal eigenvector B(i) : set of pages linking to i. #out(j) : number of outgoing links from j. e : vector of components 1/sqrt{N}. goo.gl/oNkAWJ Paolo Ferragina, Web Algorithmics

Pagerank: use in Search Engines Apache Giraph is an iterative graph processing system built for high scalability. It is currently used at Facebook. Giraph originated as the open-source counterpart to Pregel, the graph processing architecture developed at Google (2010). Both systems are inspired by the Bulk Synchronous Parallel model of distributed computation. Pagerank: use in Search Engines Preprocessing: Given graph of links, build matrix P Compute its principal eigenvector r r[i] is the pagerank of page i We are interested in the relative order At query time: Retrieve pages containing query terms Rank them by their Pagerank The final order is query-independent Paolo Ferragina, Web Algorithmics

(Personalized) Pagerank Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" (Personalized) Pagerank Fast (approximate) computation: Given the Web graph, build the matrix P Compute r = e * Pt for t=0, 1, … r[i] is the pagerank of page i Bias the random jump: Substitute e = [1 1 ….1] with a preference vector which jumps to preferred pages (topics) If e = ei , then r[j] = relatedness between node j and node i

CoSim Rank to «compare» nodes Personalized PageRank A is the adjacency matrix with normalized rows Set d=1 You can normalize it multiplying by (1-c)

HITS: Hypertext Induced Topic Search Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" HITS: Hypertext Induced Topic Search

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Calculating HITS It is query-dependent Produces two scores per page: Authority score: a good authority page for a topic is pointed to by many good hubs for that topic. Hub score: A good hub page for a topic points to many authoritative pages for that topic.

Authority and Hub scores Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Authority and Hub scores 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

HITS: Link Analysis Computation Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" HITS: Link Analysis Computation Where a: Vector of Authority’s scores h: Vector of Hub’s scores. A: Adjacency matrix in which ai,j = 1 if ij Symmetric matrices Thus, h is an eigenvector of AAt a is an eigenvector of AtA

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Weighting links Weight more if the query occurs in the neighborhood of the link (e.g. anchor text).