1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.

Slides:



Advertisements
Similar presentations
1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
Advertisements

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Department of Electrical Engineering, Technion Maxim Gurevich Department of Electrical Engineering,
Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
TrustRank Algorithm Srđan Luković 2010/3482
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
BAYESIAN INFERENCE Sampling techniques
Evaluating Search Engine
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 11, 2006
Automatic Evaluation Of Search Engines Project Poster Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Link Analysis, PageRank and Search Engines on the Web
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 April 13, 2005
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
Automatic Evaluation Of Search Engines Project Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 10 June 4, 2006
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
1 Uniform Sampling from the Web via Random Walks Ziv Bar-Yossef Alexander Berg Steve Chien Jittat Fakcharoenphol Dror Weitz University of California at.
1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef and Maxim Gurevich Department of Electrical Engineering Technion Presentation at group meeting,
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY- HARSH SINGH A Random Walk Approach to Sampling Hidden Databases By Arjun Dasgupta, Dr. Gautam Das and Heikki Mannila.
Problem 1 Given a high-resolution computer image of a map of an irregularly shaped lake with several islands, determine the water surface area. Assume.
Introduction to Monte Carlo Methods D.J.C. Mackay.
1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Monte Carlo Methods1 T Special Course In Information Science II Tomas Ukkonen
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Markov-Chain Monte Carlo CSE586 Computer Vision II Spring 2010, Penn State Univ.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
The Markov Chain Monte Carlo Method Isabelle Stanton May 8, 2008 Theory Lunch.
The Structure of Broad Topics on the Web Soumen Chakrabarti Mukul M. Joshi Kunal Punera (IIT Bombay) David M. Pennock (NEC Research Institute)
Inference Algorithms for Bayes Networks
Information Bottleneck versus Maximum Likelihood Felix Polyakov.
Advisor: Koh Jia-Ling Nonhlanhla Shongwe EFFICIENT QUERY EXPANSION FOR ADVERTISEMENT SEARCH WANG.H, LIANG.Y, FU.L, XUE.G, YU.Y SIGIR’09.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
ELEC 303 – Random Signals Lecture 17 – Hypothesis testing 2 Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 2, 2009.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
The Monte Carlo Method/ Markov Chains/ Metropolitan Algorithm from sec in “Adaptive Cooperative Systems” -summarized by Jinsan Yang.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Introduction to Sampling based inference and MCMC
FORA: Simple and Effective Approximate Single­-Source Personalized PageRank Sibo Wang, Renchi Yang, Xiaokui Xiao, Zhewei Wei, Yin Yang School of Information.
Uniform Sampling from the Web via Random Walks
Markov chain monte carlo
Haim Kaplan and Uri Zwick
Markov Networks.
Lecture 9 Randomized Algorithms
Presentation transcript:

1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

2 Search Engine Samplers Index Public Interface Public Interface Search Engine Sampler Web D Queries Top k results Random document x  D Indexed Documents

3 Motivation Useful tool for search engine evaluation:  Freshness Fraction of up-to-date pages in the index  Topical bias Identification of overrepresented/underrepresented topics  Spam Fraction of spam pages in the index  Security Fraction of pages in index infected by viruses/worms/trojans  Relative Size Number of documents indexed compared with other search engines

4 Size Wars August 2005 : We index 20 billion documents. So, who’s right? September 2005 : We index 8 billion documents, but our index is 3 times larger than our competition’s.

5 Why Does Size Matter, Anyway? Comprehensiveness  A good crawler covers the most documents possible Narrow-topic queries  E.g., get homepage of John Doe Prestige  A marketing advantage

6 Measuring size using random samples [BharatBroder98, CheneyPerry05, GulliSignorni05] Sample pages uniformly at random from the search engine’s index Two alternatives  Absolute size estimation Sample until collision Collision expected after k ~ N ½ random samples (birthday paradox) Return k 2  Relative size estimation Check how many samples from search engine A are present in search engine B and vice versa

7 Other Approaches Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00] Queries from user query logs [LawrenceGiles98, DobraFeinberg04] Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01]

8 The Bharat-Broder Sampler: Preprocessing Step C Large corpus L t 1, freq(t 1,C) t 2, freq(t 2,C) … … Lexicon

9 The Bharat-Broder Sampler Search Engine BB Sampler t 1 AND t 2 Top k results Random document from top k results L Two random terms t 1, t 2 Only if: all queries return the same number of results ≤ k all documents are of the same length Then, samples are uniform. Only if: all queries return the same number of results ≤ k all documents are of the same length Then, samples are uniform.

10 The Bharat-Broder Sampler: Drawbacks Documents have varying lengths  Bias towards long documents Some queries have more than k matches  Bias towards documents with high static rank

11 Our Contributions A pool-based sampler  Guaranteed to produce near-uniform samples A random walk sampler  After sufficiently many steps, guaranteed to produce near-uniform samples  Does not need an explicit lexicon/pool at all! Focus of this talk

12 Search Engines as Hypergraphs results(q) = { documents returned on query q } queries(x) = { queries that return x as a result } P = query pool = a set of queries Query pool hypergraph:  Vertices:Indexed documents  Hyperedges:{ result(q) | q  P } news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

13 Query Cardinalities and Document Degrees Query cardinality: card(q) = |results(q)|  card(“news”) = 4, card(“bbc”) = 3 Document degree: deg(x) = |queries(x)|  deg( = 1, deg(news.bbc.co.uk) = 2 Cardinality and degree are easily computable news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

14 Sampling documents uniformly Sampling documents from D uniformlyHard Sampling documents from D non-uniformly: Easier Will show later: can sample documents proportionally to their degrees:

15 Sampling documents by degree p(news.bbc.co.uk) = 2/13 p( = 1/ news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

16 Monte Carlo Simulation We need:Samples from the uniform distribution We have:Samples from the degree distribution Can we somehow use the samples from the degree distribution to generate samples from the uniform distribution? Yes! Monte Carlo Simulation Methods Rejection Sampling Importance Sampling Metropolis- Hastings Maximum- Degree

17 Monte Carlo Simulation  : Target distribution  In our case:  = uniform on D p: Trial distribution  In our case: p = degree distribution Bias weight of p(x) relative to  (x):  In our case: Monte Carlo Simulator Samples from p Sample from  x  Sampler (x 1,w(x)), (x 2,w(x)), … p-Sampler

18 Bias Weights Unnormalized forms of  and p: : (unknown) normalization constants Examples:   = uniform:  p = degree distribution: Bias weight:

19 C: envelope constant  C ≥ w(x) for all x The algorithm:  accept := false  while (not accept) generate a sample x from p toss a coin whose heads probability is if coin comes up heads, accept := true  return x In our case: C = 1 and acceptance prob = 1/deg(x) Rejection Sampling [von Neumann]

20 Pool-Based Sampler Degree distribution sampler Search Engine Rejection Sampling q 1,q 2,… results(q 1 ), results(q 2 ),… x Pool-Based Sampler (x 1,1/deg(x 1 )), (x 2,1/deg(x 2 )),… Uniform sample Documents sampled from degree distribution with corresponding weights Degree distribution: p(x) = deg(x) /  x’ deg(x’)

21 Sampling documents by degree Select a random query q Select a random x  results(q) Documents with high degree are more likely to be sampled If we sample q uniformly  “oversample” documents that belong to narrow queries We need to sample q proportionally to its cardinality news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

22 Sampling documents by degree (2) Select a query q proportionally to its cardinality Select a random x  results(q) Analysis: news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

23 Degree Distribution Sampler Search Engine results(q) x Cardinality Distribution Sampler Sample x uniformly from results(q) q Degree Distribution Sampler Query sampled from cardinality distribution Document sampled from degree distribution

24 Sampling queries by cardinality Sampling queries from pool uniformly:Easy Sampling queries from pool by cardinality: Hard  Requires knowing cardinalities of all queries in the search engine Use Monte Carlo methods to simulate biased sampling via uniform sampling:  Target distribution: the cardinality distribution  Trial distribution: uniform distribution on the query pool

25 Sampling queries by cardinality Bias weight of cardinality distribution relative to the uniform distribution:  Can be computed using a single search engine query Use rejection sampling:  Envelope constant for rejection sampling:  Queries are sampled uniformly from the pool  Each query q is accepted with probability

26 Degree Distribution Sampler Complete Pool-Based Sampler Search Engine Rejection Sampling x (x,1/deg(x)),… Uniform document sample Documents sampled from degree distribution with corresponding weights Uniform Query Sampler Rejection Sampling (q,card(q)),… Uniform query sample Query sampled from cardinality distribution (q,results(q)),…

27 Dealing with Overflowing Queries Problem: Some queries may overflow (card(q) > k)  Bias towards highly ranked documents Solutions:  Select a pool P in which overflowing queries are rare (e.g., phrase queries)  Skip overflowing queries  Adapt rejection sampling to deal with approximate weights Theorem: Samples of PB sampler are at most  -away from uniform. (  = overflow probability of P)

28 Creating the query pool C Large corpus P q1q1 … … Query Pool Example: P = all 3-word phrases that occur in C  If “to be or not to be” occurs in C, P contains: “to be or”, “be or not”, “or not to”, “not to be” Choose P that “covers” most documents in D q2q2

29 A random walk sampler Define a graph G over the indexed documents  (x,y)  E iff queries(x) ∩ queries(y) ≠   Run a random walk on G  Limit distribution = degree distribution  Use MCMC methods to make limit distribution uniform. Metropolis-Hastings Maximum-Degree Does not need a preprocessing step Less efficient than the pool-based sampler

30 Bias towards Long Documents

31 Relative Sizes of Google, MSN and Yahoo! Google = 1 Yahoo! = 1.28 MSN Search = 0.73

32 Top-Level Domains in Google, MSN and Yahoo!

33 Conclusions Two new search engine samplers  Pool-based sampler  Random walk sampler Samplers are guaranteed to produce near- uniform samples, under plausible assumptions. Samplers show no or little bias in experiments.

34 Thank You