1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.

1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion

2 Search Engine Samplers Index Public Interface Public Interface Search Engine Sampler Web D Queries Top k results Random document x  D Indexed Documents

3 Motivation Useful tool for search engine evaluation:  Freshness Fraction of up-to-date pages in the index  Topical bias Identification of overrepresented/underrepresented topics  Spam Fraction of spam pages in the index  Security Fraction of pages in index infected by viruses/worms/trojans  Relative Size Number of documents indexed compared with other search engines

4 Size Wars August 2005 : We index 20 billion documents. So, who’s right? September 2005 : We index 8 billion documents, but our index is 3 times larger than our competition’s.

5 Why Does Size Matter, Anyway? Comprehensiveness  A good crawler covers the most documents possible Narrow-topic queries  E.g., get homepage of John Doe Prestige  A marketing advantage

6 Measuring size using random samples [BharatBroder98, CheneyPerry05, GulliSignorni05] Sample pages uniformly at random from the search engine’s index Two alternatives  Absolute size estimation Sample until collision Collision expected after k ~ N ½ random samples (birthday paradox) Return k 2  Relative size estimation Check how many samples from search engine A are present in search engine B and vice versa

7 Other Approaches Anecdotal queries [SearchEngineWatch, Google, BradlowSchmittlein00] Queries from user query logs [LawrenceGiles98, DobraFeinberg04] Random sampling from the whole web [Henzinger et al 00, Bar-Yossef et al 00, Rusmevichientong et al 01]

8 The Bharat-Broder Sampler: Preprocessing Step C Large corpus L t 1, freq(t 1,C) t 2, freq(t 2,C) … … Lexicon

9 The Bharat-Broder Sampler Search Engine BB Sampler t 1 AND t 2 Top k results Random document from top k results L Two random terms t 1, t 2 Only if: all queries return the same number of results ≤ k all documents are of the same length Then, samples are uniform. Only if: all queries return the same number of results ≤ k all documents are of the same length Then, samples are uniform.

10 The Bharat-Broder Sampler: Drawbacks Documents have varying lengths  Bias towards long documents Some queries have more than k matches  Bias towards documents with high static rank

11 Our Contributions A pool-based sampler  Guaranteed to produce near-uniform samples A random walk sampler  After sufficiently many steps, guaranteed to produce near-uniform samples  Does not need an explicit lexicon/pool at all! Focus of this talk

12 Search Engines as Hypergraphs results(q) = { documents returned on query q } queries(x) = { queries that return x as a result } P = query pool = a set of queries Query pool hypergraph:  Vertices:Indexed documents  Hyperedges:{ result(q) | q  P } www.cnn.com www.foxnews.com news.google.com news.bbc.co.uk www.google.com maps.google.com www.bbc.co.uk www.mapquest.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

13 Query Cardinalities and Document Degrees Query cardinality: card(q) = |results(q)|  card(“news”) = 4, card(“bbc”) = 3 Document degree: deg(x) = |queries(x)|  deg(www.cnn.com) = 1, deg(news.bbc.co.uk) = 2 Cardinality and degree are easily computable www.cnn.com www.foxnews.com news.google.com news.bbc.co.uk www.google.com maps.google.com www.bbc.co.uk www.mapquest.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

14 Sampling documents uniformly Sampling documents from D uniformlyHard Sampling documents from D non-uniformly: Easier Will show later: can sample documents proportionally to their degrees:

15 Sampling documents by degree p(news.bbc.co.uk) = 2/13 p(www.cnn.com) = 1/13 www.cnn.com www.foxnews.com news.google.com news.bbc.co.uk www.google.com maps.google.com www.bbc.co.uk www.mapquest.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

16 Monte Carlo Simulation We need:Samples from the uniform distribution We have:Samples from the degree distribution Can we somehow use the samples from the degree distribution to generate samples from the uniform distribution? Yes! Monte Carlo Simulation Methods Rejection Sampling Importance Sampling Metropolis- Hastings Maximum- Degree

17 Monte Carlo Simulation  : Target distribution  In our case:  = uniform on D p: Trial distribution  In our case: p = degree distribution Bias weight of p(x) relative to  (x):  In our case: Monte Carlo Simulator Samples from p Sample from  x  Sampler (x 1,w(x)), (x 2,w(x)), … p-Sampler

18 Bias Weights Unnormalized forms of  and p: : (unknown) normalization constants Examples:   = uniform:  p = degree distribution: Bias weight:

19 C: envelope constant  C ≥ w(x) for all x The algorithm:  accept := false  while (not accept) generate a sample x from p toss a coin whose heads probability is if coin comes up heads, accept := true  return x In our case: C = 1 and acceptance prob = 1/deg(x) Rejection Sampling [von Neumann]

20 Pool-Based Sampler Degree distribution sampler Search Engine Rejection Sampling q 1,q 2,… results(q 1 ), results(q 2 ),… x Pool-Based Sampler (x 1,1/deg(x 1 )), (x 2,1/deg(x 2 )),… Uniform sample Documents sampled from degree distribution with corresponding weights Degree distribution: p(x) = deg(x) /  x’ deg(x’)

21 Sampling documents by degree Select a random query q Select a random x  results(q) Documents with high degree are more likely to be sampled If we sample q uniformly  “oversample” documents that belong to narrow queries We need to sample q proportionally to its cardinality www.cnn.com www.foxnews.com news.google.com news.bbc.co.uk www.google.com maps.google.com www.bbc.co.uk www.mapquest.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

22 Sampling documents by degree (2) Select a query q proportionally to its cardinality Select a random x  results(q) Analysis: www.cnn.com www.foxnews.com news.google.com news.bbc.co.uk www.google.com maps.google.com www.bbc.co.uk www.mapquest.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

23 Degree Distribution Sampler Search Engine results(q) x Cardinality Distribution Sampler Sample x uniformly from results(q) q Degree Distribution Sampler Query sampled from cardinality distribution Document sampled from degree distribution

24 Sampling queries by cardinality Sampling queries from pool uniformly:Easy Sampling queries from pool by cardinality: Hard  Requires knowing cardinalities of all queries in the search engine Use Monte Carlo methods to simulate biased sampling via uniform sampling:  Target distribution: the cardinality distribution  Trial distribution: uniform distribution on the query pool

25 Sampling queries by cardinality Bias weight of cardinality distribution relative to the uniform distribution:  Can be computed using a single search engine query Use rejection sampling:  Envelope constant for rejection sampling:  Queries are sampled uniformly from the pool  Each query q is accepted with probability

26 Degree Distribution Sampler Complete Pool-Based Sampler Search Engine Rejection Sampling x (x,1/deg(x)),… Uniform document sample Documents sampled from degree distribution with corresponding weights Uniform Query Sampler Rejection Sampling (q,card(q)),… Uniform query sample Query sampled from cardinality distribution (q,results(q)),…

27 Dealing with Overflowing Queries Problem: Some queries may overflow (card(q) > k)  Bias towards highly ranked documents Solutions:  Select a pool P in which overflowing queries are rare (e.g., phrase queries)  Skip overflowing queries  Adapt rejection sampling to deal with approximate weights Theorem: Samples of PB sampler are at most  -away from uniform. (  = overflow probability of P)

28 Creating the query pool C Large corpus P q1q1 … … Query Pool Example: P = all 3-word phrases that occur in C  If “to be or not to be” occurs in C, P contains: “to be or”, “be or not”, “or not to”, “not to be” Choose P that “covers” most documents in D q2q2

29 A random walk sampler Define a graph G over the indexed documents  (x,y)  E iff queries(x) ∩ queries(y) ≠   Run a random walk on G  Limit distribution = degree distribution  Use MCMC methods to make limit distribution uniform. Metropolis-Hastings Maximum-Degree Does not need a preprocessing step Less efficient than the pool-based sampler

30 Bias towards Long Documents

31 Relative Sizes of Google, MSN and Yahoo! Google = 1 Yahoo! = 1.28 MSN Search = 0.73

32 Top-Level Domains in Google, MSN and Yahoo!

33 Conclusions Two new search engine samplers  Pool-based sampler  Random walk sampler Samplers are guaranteed to produce near- uniform samples, under plausible assumptions. Samplers show no or little bias in experiments.

34 Thank You

1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.

Similar presentations

Presentation on theme: "1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.

Similar presentations

Presentation on theme: "1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion."— Presentation transcript:

Similar presentations

About project

Feedback