Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.

Slides:



Advertisements
Similar presentations
1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
Advertisements

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Department of Electrical Engineering, Technion Maxim Gurevich Department of Electrical Engineering,
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Fast Algorithms For Hierarchical Range Histogram Constructions
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Regression. So far, we've been looking at classification problems, in which the y values are either 0 or 1. Now we'll briefly consider the case where.
Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin Nov
Evaluating Search Engine
Search Engines and Information Retrieval
1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 11, 2006
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 10: Hypothesis Tests for Two Means: Related & Independent Samples.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Evaluating Hypotheses
Link Analysis, PageRank and Search Engines on the Web
Parametric Inference.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 10 June 4, 2006
Experimental Evaluation
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef and Maxim Gurevich Department of Electrical Engineering Technion Presentation at group meeting,
The Lognormal Distribution
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Search Engines and Information Retrieval Chapter 1.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Lecture 19: Free Energies in Modern Computational Statistical Thermodynamics: WHAM and Related Methods Dr. Ronald M. Levy Statistical.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Learning from Observations Chapter 18 Through
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
IR System Evaluation Farhad Oroumchian. IR System Evaluation System-centered strategy –Given documents, queries, and relevance judgments –Try several.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Ensemble Methods in Machine Learning
Evaluating VR Systems. Scenario You determine that while looking around virtual worlds is natural and well supported in VR, moving about them is a difficult.
Statistical Properties of Text
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Conditional Expectation
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
SIMILARITY SEARCH The Metric Space Approach
Random Testing: Theoretical Results and Practical Implications IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2012 Andrea Arcuri, Member, IEEE, Muhammad.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Search Engines and Link Analysis on the Web
FORA: Simple and Effective Approximate Single­-Source Personalized PageRank Sibo Wang, Renchi Yang, Xiaokui Xiao, Zhewei Wei, Yin Yang School of Information.
Uniform Sampling from the Web via Random Walks
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Machine Learning: Lecture 5
Presentation transcript:

Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google

Search Engine Benchmarks State of the art: No objective benchmarks for search engines Need to rely on “anecdotal” studies or on subjective search engine reports Users, advertisers, partners cannot compare search engines Our goal: Design search engine benchmarking techniques Accurate Efficient Objective Transparent

Search Engine Corpus Evaluation Corpus size How many pages are indexed? Search engine overlap What fraction of the pages indexed by search engine A are also indexed by search engine B? Freshness How old are the pages in the index? Spam resilience What fraction of the pages in the index are spam? Duplicates How many unique pages are there in the index?

Search Engine Corpus Metrics Index Public Interface Public Interface Search Engine Web D Indexed Documents Corpus size Number of unique pages Overlap Average age of a page Focus of this talk Target function

Search Engine Estimators Index Public Interface Public Interface Search Engine Estimator Web D Queries Top k results Estimate of |D| Indexed Documents

Success Criteria Estimation accuracy: Bias E(Estimate - |D|) Amortized cost (cost times variance): Amortized query cost Amortized fetch cost Amortized function cost

Previous Work Average metrics: Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00] Queries from user query logs [LawrenceGiles98, DobraFeinberg04] Random queries [BharatBroder98, CheneyPerry05, GulliSignorini05, BarYossefGurevich06, Broder et al 06] Random sampling from the web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01] Sum metrics: Random queries [Broder et al 06]

Our Contributions A new search engine estimator Applicable to both sum metrics and average metrics Arbitrary target functions Arbitrary target distributions (measures) Less bias than the Broder et al estimator In one experiment, empirical relative bias was reduced from 75% to 0.01% More efficient than the BarYossefGurevich06 estimator In one experiment, query cost was reduced by a factor of 375. Techniques Approximate ratio importance sampling Rao-Blackwellization

Roadmap Recast the Broder et al corpus size estimator as an importance sampling estimator. Describe the “degree mismatch problem” (DMP) Show how to overcome DMP using approximate ratio importance sampling Discuss Rao-Blackwellization Gloss over some experimental results

Query Pools C Training corpus of web documents P q1q1 … … Query Pool Working example: P = all length-3 phrases that occur in C If “to be or not to be” occurs in C, P contains: “to be or”, “be or not”, “or not to”, “not to be” Choose P that “covers” most documents in D q2q2 Pre-processing step: Create a query pool

maps.yahoo.com The Search Engine Graph P = query pool neighbors(q) = { documents returned on query q } deg(q) = |neighbors(q)| neighbors(x) = { queries that return x as a result } deg(x) = |neighbors(x)| news.google.com news.bbc.co.uk maps.google.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC deg(“news”) = 4, deg(“bbc”) = 3 deg( = 1, deg(news.bbc.co.uk) = 2

Corpus Size as an Integral E = Edges in the queries-documents graph Lemma: Proof: Contribution of edge (q,x) to sum: 1/deg(x) Total contribution of edges incident to x: 1 Total contribution of all edges: |D| news.google.com news.bbc.co.uk maps.google.com “news” “bbc” “google” en.wikipedia.org/wiki/BBC

Corpus Size as an Integral Express corpus size as an integral: Target measure:  (q,x) = 1/deg(x) Target function:f(q,x) = 1

Monte Carlo Estimation Monte Carlo estimation of the integral Sample (Q,X) according to  Output f(Q,X) Works only if:  is a proper distribution We can easily sample from  BUT, In our case  is not a distribution Even if it was, sampling from  1/deg(x) may not be easy So instead, we sample (Q,X) from an easy “trial distribution” p

Sampling Edges, Easily Search Engine Sampler Q Top k results (Q,X) X - a random result of Q P A random query Q Sample an edge (q,x) with probability p(q,x) = 1/(|P| ¢ deg(q))

Importance Sampling (IS) [Marshal56] We have: A sample (Q,X) from p We need: Estimate the integral So we cannot use simple Monte Carlo estimation Importance sampling comes to the rescue… Compute an “importance weight” for (Q,X): Importance sampling estimator:

IS: Bias Analysis

Computing the Importance Sampling Estimator We need to compute Computing |P| is easy – we know P How to compute deg(Q) = |neighbors(Q)|? Since Q was submitted to the search engine, we know deg(Q) How to compute deg(X) = |neighbors(X)|? Fetch content of X from the web pdeg(X) = number of distinct queries from P that X contains Use pdex(X) as an estimate for deg(X)

The Degree Mismatch Problem (DMP) In reality, pdeg(X) may be different from deg(X) Neighbor recall problem: There may be q  neighbors(x) that do not occur in x q occurs as “anchor text” in a page linking to x q occurs in x, but our parser failed to find it Neighbor precision problem: There may be q that occur in x, but q  neighbors(x) q “overflows” q occurs in x, but the search engine’s parser failed to find it

Implications of DMP Can only approximate document degrees Bias of importance sampling estimator may become significant In one of our experiments, relative bias was 75%

Eliminating the Neighbor Recall Problem The predicted search engine graph: pneighbors(x) = queries that occur in x pneighbors(q) = documents in whose text q occurs An edge (q,x) is “valid”, if it occurs both in the search engine graph and the predicted search engine graph The valid search engine graph: vneighbors(x) = neighbors(x) ∩ pneighbors(x) vneighbors(q) = neighbors(q) ∩ pneighbors(q)

Eliminating the Neighbor Recall Problem (cont.) We use the valid search engine graph rather than the real search engine graph: vdeg(q) = |vneighbors(q)| vdeg(x) = |vneighbors(x)| P + = queries q in P with vdeg(q) > 0 D + = documents x in D with vdeg(x) > 0 Assuming D + = D, then E(IS(Q,X)) = |D|

Approximate Importance Sampling (AIS) We need to compute vdeg(Q) – Easy vdeg(X) – Hard |P + | - Hard We therefore approximate |P + | and vdeg(X): IVD(X) = unbiased probabilistic estimator for pdeg(X)/vdeg(X)

Estimating pdeg(x)/vdeg(x) Given:A document x Want:Estimate pdeg(x) / vdeg(x) Geometric estimation: n = 1 forever do Choose a random phrase Q that occurs in content(x) Send Q to the search engine If x  neighbors(Q), return n n  n + 1 Probability to hit a “valid” query: vdeg(x) / pdeg(x) So, expected number of iterations: pdeg(x) / vdeg(x)

Approximate Importance Sampling: Bias Analysis Lemma: Multiplicative bias of AIS(Q,X) is

Approximate Importance Sampling: Bias Elimination How to eliminate the bias in AIS? Estimate the bias |P|/|P + | Divide AIS by this estimate Well, this doesn’t quite work Expected ratio ≠ ratio of expectations So, use a standard trick in estimation of ratio statistics: BE = estimator of |P|/|P + |

Bias Analysis Theorem:

Estimating |P|/|P + | Also by geometric estimation: n = 1 forever do Choose a random query Q from P Send Q to the search engine If vdeg(Q) > 0, return n n  n + 1 Probability to hit a “valid” query: |P + |/|P| So, expected number of iterations: |P|/|P + |

Recap 1.Sample valid edges (Q 1,X 1 ),…,(Q n,X n ) from p 2.Compute vdeg(Q i ) for each query Q i 3.Compute pdeg(X i ) for each document X i 4.Estimate IVD(X i ) = pdeg(X i )/vdeg(X i ) for each X i 5.Compute AIS 6.Estimate the expected bias BE i = |P|/|P + | 7.Output

Rao-Blackwellization Question: We currently use only one (random) result for each query submitted to the search engine. Can we use also the rest? Rao & Blackwell: Sure! Use them as additional samples. It can only help! The Rao-Blackwellized AIS estimator: Recall:

RB-AIS: Analysis The Rao-Blackwell Theorem: AIS RB has exactly the same bias as AIS The variance of AIS RB can only be lower Variance reduces, if query results are sufficiently “variable” Now, use AIS RB instead of AIS in SizeEstimator:

Corpus Size, Bias Comparison

Corpus Size, Query Cost Comparison

Corpus Size Estimations for 3 Major Search Engines

Thank You

Average Metric, Bias Comparison

Average Metric, Query Cost Comparison