Quality of a search engine

Quality of a search engine
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa

Is it good ? How fast does it index How fast does it search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Is it good ? How fast does it index Number of documents/hour (Average document size) How fast does it search Latency as a function of index size Expressiveness of the query language

Measures for a search engine
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Measures for a search engine All of the preceding criteria are measurable The key measure: user happiness …useless answers won’t make a user happy User groups for testing !!

General scenario collection Retrieved Relevant
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" General scenario Relevant Retrieved collection

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Precision vs. Recall Precision: % docs retrieved that are relevant [issue “junk” found] Recall: % docs relevant that are retrieved [issue “info” found] collection Retrieved Relevant

Precision P = tp/(tp + fp)
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" How to compute them Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) Relevant Not Relevant Retrieved tp (true positive) fp (false positive) Not Retrieved fn (false negative) tn (true negative)

Precision-Recall curve
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Precision-Recall curve Measure Precision at various levels of Recall precision x x x x recall

A common picture x precision x x x recall
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" A common picture x precision x x x recall

F measure Combined measure (weighted harmonic mean):
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" F measure Combined measure (weighted harmonic mean): People usually use balanced F1 measure i.e., with  = ½ thus 1/F = ½ (1/P + 1/R)

Ranking Link-based Ranking (2° generation) Reading 21

The Web as a Directed Graph
Sec. 21.1 The Web as a Directed Graph Page A hyperlink Page B Anchor Assumption 1: A hyperlink between pages denotes author perceived relevance (quality signal) Assumption 2: The text in the anchor of the hyperlink describes the target page (textual context)

Anchor Text For ibm how to distinguish between:
Sec Anchor Text For ibm how to distinguish between: IBM’s home page (mostly graphical) IBM’s copyright page (high term freq. for ‘ibm’) Rival’s spam page (arbitrarily high term freq.) “IBM home page” “ibm.com” “ibm” A million pieces of anchor text with “ibm” send a strong signal

Sec Indexing anchor text When indexing a document D, include anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today Big Blue today announced record profits for the quarter Joe’s computer hardware links Sun HP IBM

Sec Indexing anchor text Can sometime have unexpected side effects - e.g., evil empire. Can score anchor text with weight depending on the authority of the anchor page’s website

Random Walks Paolo Ferragina Dipartimento di Informatica Università di Pisa

Definitions Adjacency matrix A Transition matrix P 1 2 3 1 2 3
1/2 1 2 3 Any edge weigthing is possible: Any proposals?

What is a random walk t=0 1 1/2

What is a random walk 1 1/2 t=1 t=0 1 1/2

What is a random walk 1 1/2 t=0 1 1/2 t=1 t=2 1 1/2

What is a random walk 1 1/2 t=1 t=0 1 1/2 t=2 1 1/2 t=3 1 1/2

Probability Distributions
xt(i) = probability that surfer is at node i at time t xt+1(i) = ∑j (Probability of being at node j)*Pr(j->i) = ∑j xt(j)*P(j,i) = xt * P t=2 1 2 3 1/2 Transition matrix P 1/2 xt+1 xt =

Probability Distributions
xt(i) = probability that surfer is at node i at time t xt+1(i) = ∑j (Probability of being at node j)*Pr(j->i) = ∑j xt(j)*P(j,i) = xt P xt+1 = xt P = xt-1* P * P = xt-2* P * P * P = …= x0 Pt+1 What happens when the surfer keeps walking for a long time? So called Stationary distribution

Stationary Distribution
The stationary distribution at a node is related to the amount/proportion of time a random walker spends visiting that node. It is when the distribution does not change anymore: i.e. xT+1 = xT  xT P = 1* xT (left eigenvector of eigenvalue 1) For “well-behaved” graphs this does not depend on the start distribution: xt+1 = x0 P t+1

Interesting questions
Does a stationary distribution always exist? Is it unique? Yes, if the graph is “well-behaved”, namely the markov chain is irreducible and aperiodic. How fast will the random surfer approach this stationary distribution? Mixing Time!

Well behaved graphs Irreducible: There is a path from every node to every other node ( it is an SCC). Irreducible Not irreducible

Well behaved graphs Aperiodic: The GCD of all cycle lengths is 1. The GCD is also called period. Aperiodic Periodicity is 3

About undirected graphs
A connected undirected graph is irreducible A connected non-bipartite undirected graph has a stationary distribution proportional to the degree distribution! Makes sense, since larger the degree of the node more likely a random walk is to come back to it.

PageRanks and HITS Paolo Ferragina Dipartimento di Informatica Università di Pisa

Query-independent ordering
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Query-independent ordering First generation: using link counts as simple measures of popularity. Undirected popularity: Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5). Directed popularity: Score of a page = number of its in-links (es. 3). Easy to SPAM

Second generation: PageRank
Deploy the graph structure, and each link has its own importance ! PageRank is independent of the query many interpretations: Linear algebra – eigenvectors, eigenvalues Markov chains – steady state probability distribution Social interpretation – a sort of voting scheme Paolo Ferragina, Web Algorithmics

Basic Intuition… d 1-d Random jump to any node Random jump
to neighbors d Paolo Ferragina, Web Algorithmics

Google’s Pagerank Random jump B(i) : set of pages linking to i.
Principal eigenvector B(i) : set of pages linking to i. #out(j) : number of outgoing links from j. e : vector of components 1/sqrt{N}. goo.gl/oNkAWJ Paolo Ferragina, Web Algorithmics

Pagerank: use in Search Engines
Apache Giraph is an iterative graph processing system built for high scalability. It is currently used at Facebook. Giraph originated as the open-source counterpart to Pregel, the graph processing architecture developed at Google (2010). Both systems are inspired by the Bulk Synchronous Parallel model of distributed computation. Pagerank: use in Search Engines Preprocessing: Given graph of links, build matrix P Compute its principal eigenvector r r[i] is the pagerank of page i We are interested in the relative order At query time: Retrieve pages containing query terms Rank them by their Pagerank The final order is query-independent Paolo Ferragina, Web Algorithmics

(Personalized) Pagerank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" (Personalized) Pagerank Fast (approximate) computation: Given the Web graph, build the matrix P Compute r = e * Pt for t=0, 1, … r[i] is the pagerank of page i Bias the random jump: Substitute e = [1 1 ….1] with a preference vector which jumps to preferred pages (topics) If e = ei , then r[j] = relatedness between node j and node i

CoSim Rank to «compare» nodes
Personalized PageRank A is the adjacency matrix with normalized rows Set d=1 You can normalize it multiplying by (1-c)

HITS: Hypertext Induced Topic Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" HITS: Hypertext Induced Topic Search

Calculating HITS It is query-dependent Produces two scores per page: Authority score: a good authority page for a topic is pointed to by many good hubs for that topic. Hub score: A good hub page for a topic points to many authoritative pages for that topic.

Authority and Hub scores
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Authority and Hub scores 2 5 3 1 1 6 4 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

HITS: Link Analysis Computation
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" HITS: Link Analysis Computation Where a: Vector of Authority’s scores h: Vector of Hub’s scores. A: Adjacency matrix in which ai,j = 1 if ij Symmetric matrices Thus, h is an eigenvector of AAt a is an eigenvector of AtA

Weighting links Weight more if the query occurs in the neighborhood of the link (e.g. anchor text).

Quality of a search engine

Similar presentations

Presentation on theme: "Quality of a search engine"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Quality of a search engine

Similar presentations

Presentation on theme: "Quality of a search engine"— Presentation transcript:

Similar presentations

About project

Feedback