Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.

Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google

Search Engine Benchmarks State of the art: No objective benchmarks for search engines Need to rely on “anecdotal” studies or on subjective search engine reports Users, advertisers, partners cannot compare search engines Our goal: Design search engine benchmarking techniques Accurate Efficient Objective Transparent

Search Engine Corpus Evaluation Corpus size How many pages are indexed? Search engine overlap What fraction of the pages indexed by search engine A are also indexed by search engine B? Freshness How old are the pages in the index? Spam resilience What fraction of the pages in the index are spam? Duplicates How many unique pages are there in the index?

Search Engine Corpus Metrics Index Public Interface Public Interface Search Engine Web D Indexed Documents Corpus size Number of unique pages Overlap Average age of a page Focus of this talk Target function

Search Engine Estimators Index Public Interface Public Interface Search Engine Estimator Web D Queries Top k results Estimate of |D| Indexed Documents

Success Criteria Estimation accuracy: Bias E(Estimate - |D|) Amortized cost (cost times variance): Amortized query cost Amortized fetch cost Amortized function cost

Previous Work Average metrics: Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00] Queries from user query logs [LawrenceGiles98, DobraFeinberg04] Random queries [BharatBroder98, CheneyPerry05, GulliSignorini05, BarYossefGurevich06, Broder et al 06] Random sampling from the web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01] Sum metrics: Random queries [Broder et al 06]

Our Contributions A new search engine estimator Applicable to both sum metrics and average metrics Arbitrary target functions Arbitrary target distributions (measures) Less bias than the Broder et al estimator In one experiment, empirical relative bias was reduced from 75% to 0.01% More efficient than the BarYossefGurevich06 estimator In one experiment, query cost was reduced by a factor of 375. Techniques Approximate ratio importance sampling Rao-Blackwellization

Roadmap Recast the Broder et al corpus size estimator as an importance sampling estimator. Describe the “degree mismatch problem” (DMP) Show how to overcome DMP using approximate ratio importance sampling Discuss Rao-Blackwellization Gloss over some experimental results

Query Pools C Training corpus of web documents P q1q1 … … Query Pool Working example: P = all length-3 phrases that occur in C If “to be or not to be” occurs in C, P contains: “to be or”, “be or not”, “or not to”, “not to be” Choose P that “covers” most documents in D q2q2 Pre-processing step: Create a query pool

maps.yahoo.com The Search Engine Graph P = query pool neighbors(q) = { documents returned on query q } deg(q) = |neighbors(q)| neighbors(x) = { queries that return x as a result } deg(x) = |neighbors(x)| www.cnn.com www.foxnews.com news.google.com news.bbc.co.uk www.google.com maps.google.com www.bbc.co.uk www.mapquest.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC deg(“news”) = 4, deg(“bbc”) = 3 deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2

Corpus Size as an Integral E = Edges in the queries-documents graph Lemma: Proof: Contribution of edge (q,x) to sum: 1/deg(x) Total contribution of edges incident to x: 1 Total contribution of all edges: |D| www.cnn.com www.foxnews.com news.google.com news.bbc.co.uk www.google.com maps.google.com www.bbc.co.uk “news” “bbc” “google” en.wikipedia.org/wiki/BBC

Corpus Size as an Integral Express corpus size as an integral: Target measure:  (q,x) = 1/deg(x) Target function:f(q,x) = 1

Monte Carlo Estimation Monte Carlo estimation of the integral Sample (Q,X) according to  Output f(Q,X) Works only if:  is a proper distribution We can easily sample from  BUT, In our case  is not a distribution Even if it was, sampling from  1/deg(x) may not be easy So instead, we sample (Q,X) from an easy “trial distribution” p

Sampling Edges, Easily Search Engine Sampler Q Top k results (Q,X) X - a random result of Q P A random query Q Sample an edge (q,x) with probability p(q,x) = 1/(|P| ¢ deg(q))

Importance Sampling (IS) [Marshal56] We have: A sample (Q,X) from p We need: Estimate the integral So we cannot use simple Monte Carlo estimation Importance sampling comes to the rescue… Compute an “importance weight” for (Q,X): Importance sampling estimator:

IS: Bias Analysis

Computing the Importance Sampling Estimator We need to compute Computing |P| is easy – we know P How to compute deg(Q) = |neighbors(Q)|? Since Q was submitted to the search engine, we know deg(Q) How to compute deg(X) = |neighbors(X)|? Fetch content of X from the web pdeg(X) = number of distinct queries from P that X contains Use pdex(X) as an estimate for deg(X)

The Degree Mismatch Problem (DMP) In reality, pdeg(X) may be different from deg(X) Neighbor recall problem: There may be q  neighbors(x) that do not occur in x q occurs as “anchor text” in a page linking to x q occurs in x, but our parser failed to find it Neighbor precision problem: There may be q that occur in x, but q  neighbors(x) q “overflows” q occurs in x, but the search engine’s parser failed to find it

Implications of DMP Can only approximate document degrees Bias of importance sampling estimator may become significant In one of our experiments, relative bias was 75%

Eliminating the Neighbor Recall Problem The predicted search engine graph: pneighbors(x) = queries that occur in x pneighbors(q) = documents in whose text q occurs An edge (q,x) is “valid”, if it occurs both in the search engine graph and the predicted search engine graph The valid search engine graph: vneighbors(x) = neighbors(x) ∩ pneighbors(x) vneighbors(q) = neighbors(q) ∩ pneighbors(q)

Eliminating the Neighbor Recall Problem (cont.) We use the valid search engine graph rather than the real search engine graph: vdeg(q) = |vneighbors(q)| vdeg(x) = |vneighbors(x)| P + = queries q in P with vdeg(q) > 0 D + = documents x in D with vdeg(x) > 0 Assuming D + = D, then E(IS(Q,X)) = |D|

Approximate Importance Sampling (AIS) We need to compute vdeg(Q) – Easy vdeg(X) – Hard |P + | - Hard We therefore approximate |P + | and vdeg(X): IVD(X) = unbiased probabilistic estimator for pdeg(X)/vdeg(X)

Estimating pdeg(x)/vdeg(x) Given:A document x Want:Estimate pdeg(x) / vdeg(x) Geometric estimation: n = 1 forever do Choose a random phrase Q that occurs in content(x) Send Q to the search engine If x  neighbors(Q), return n n  n + 1 Probability to hit a “valid” query: vdeg(x) / pdeg(x) So, expected number of iterations: pdeg(x) / vdeg(x)

Approximate Importance Sampling: Bias Analysis Lemma: Multiplicative bias of AIS(Q,X) is

Approximate Importance Sampling: Bias Elimination How to eliminate the bias in AIS? Estimate the bias |P|/|P + | Divide AIS by this estimate Well, this doesn’t quite work Expected ratio ≠ ratio of expectations So, use a standard trick in estimation of ratio statistics: BE = estimator of |P|/|P + |

Bias Analysis Theorem:

Estimating |P|/|P + | Also by geometric estimation: n = 1 forever do Choose a random query Q from P Send Q to the search engine If vdeg(Q) > 0, return n n  n + 1 Probability to hit a “valid” query: |P + |/|P| So, expected number of iterations: |P|/|P + |

Recap 1.Sample valid edges (Q 1,X 1 ),…,(Q n,X n ) from p 2.Compute vdeg(Q i ) for each query Q i 3.Compute pdeg(X i ) for each document X i 4.Estimate IVD(X i ) = pdeg(X i )/vdeg(X i ) for each X i 5.Compute AIS 6.Estimate the expected bias BE i = |P|/|P + | 7.Output

Rao-Blackwellization Question: We currently use only one (random) result for each query submitted to the search engine. Can we use also the rest? Rao & Blackwell: Sure! Use them as additional samples. It can only help! The Rao-Blackwellized AIS estimator: Recall:

RB-AIS: Analysis The Rao-Blackwell Theorem: AIS RB has exactly the same bias as AIS The variance of AIS RB can only be lower Variance reduces, if query results are sufficiently “variable” Now, use AIS RB instead of AIS in SizeEstimator:

Corpus Size, Bias Comparison

Corpus Size, Query Cost Comparison

Corpus Size Estimations for 3 Major Search Engines

Thank You

Average Metric, Bias Comparison

Average Metric, Query Cost Comparison

Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.

Similar presentations

Presentation on theme: "Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.

Similar presentations

Presentation on theme: "Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google."— Presentation transcript:

Similar presentations

About project

Feedback