1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 11, 2006

Slides:



Advertisements
Similar presentations
1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
Advertisements

1 Random Sampling from a Search Engines Index Ziv Bar-Yossef Department of Electrical Engineering, Technion Maxim Gurevich Department of Electrical Engineering,
Shortest Vector In A Lattice is NP-Hard to approximate
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Foundations of Cryptography Lecture 10 Lecturer: Moni Naor.
Order Statistics Sorted
Monte Carlo Methods and Statistical Physics
Outline input analysis input analyzer of ARENA parameter estimation
Statistics review of basic probability and statistics.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Randomized Algorithms Kyomin Jung KAIST Applied Algorithm Lab Jan 12, WSAC
April 2, 2015Applied Discrete Mathematics Week 8: Advanced Counting 1 Random Variables In some experiments, we would like to assign a numerical value to.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
Complexity 18-1 Complexity Andrei Bulatov Probabilistic Algorithms.
Simulation Where real stuff starts. ToC 1.What, transience, stationarity 2.How, discrete event, recurrence 3.Accuracy of output 4.Monte Carlo 5.Random.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef Maxim Gurevich Department of Electrical Engineering Technion.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Efficiency of Algorithms
Randomized Computation Roni Parshani Orly Margalit Eran Mantzur Avi Mintz
1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.
Lecture 3 Aug 31, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis discussion of lab – permutation generation.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Automatic Evaluation Of Search Engines Project Presentation Team members: Levin Boris Laserson Itamar Instructor Name: Gurevich Maxim.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 10 June 4, 2006
Lecture 3 Feb 7, 2011 Goals: Chapter 2 (algorithm analysis) Examples: Selection sorting rules for algorithm analysis Image representation Image processing.
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
1 Random Sampling from a Search Engine‘s Index Ziv Bar-Yossef and Maxim Gurevich Department of Electrical Engineering Technion Presentation at group meeting,
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Introduction to Monte Carlo Methods D.J.C. Mackay.
Randomized Algorithms Morteza ZadiMoghaddam Amin Sayedi.
1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo
 1  Outline  stages and topics in simulation  generation of random variates.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Random Sampling, Point Estimation and Maximum Likelihood.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximation Algorithms Pages ADVANCED TOPICS IN COMPLEXITY THEORY.
The Complexity of Optimization Problems. Summary -Complexity of algorithms and problems -Complexity classes: P and NP -Reducibility -Karp reducibility.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
Monte Carlo Methods1 T Special Course In Information Science II Tomas Ukkonen
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
Monte Carlo Methods So far we have discussed Monte Carlo methods based on a uniform distribution of random numbers on the interval [0,1] p(x) = 1 0  x.
CSC 211 Data Structures Lecture 13
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
CpSc 881: Machine Learning Evaluating Hypotheses.
1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.
Machine Learning Chapter 5. Evaluating Hypotheses
Amplification and Derandomization Without Slowdown Dana Moshkovitz MIT Joint work with Ofer Grossman (MIT)
Lecture 10. Paradigm #8: Randomized Algorithms Back to the “majority problem” (finding the majority element in an array A). FIND-MAJORITY(A, n) while (true)
Seminar on random walks on graphs Lecture No. 2 Mille Gandelsman,
The Markov Chain Monte Carlo Method Isabelle Stanton May 8, 2008 Theory Lunch.
ICS 353: Design and Analysis of Algorithms
Communication Complexity Guy Feigenblat Based on lecture by Dr. Ely Porat Some slides where adapted from various sources Complexity course Computer science.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
ELEC 303 – Random Signals Lecture 17 – Hypothesis testing 2 Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 2, 2009.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Random Sampling Algorithms with Applications Kyomin Jung KAIST Aug ERC Workshop.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Understanding Generalization in Adaptive Data Analysis
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Presentation transcript:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 11 June 11,

2 Random Sampling from a Search Engine’s Index

3 Search Engine Samplers Index Public Interface Public Interface Search Engine Sampler Web D Queries Top k results Random document x  D Indexed Documents

4 Motivation Useful tool for search engine evaluation:  Freshness Fraction of up-to-date pages in the index  Topical bias Identification of overrepresented/underrepresented topics  Spam Fraction of spam pages in the index  Security Fraction of pages in index infected by viruses/worms/trojans  Relative Size Number of documents indexed compared with other search engines

5 Size Wars August 2005 : We index 20 billion documents. So, who’s right? September 2005 : We index 8 billion documents, but our index is 3 times larger than our competition’s.

6 The Bharat-Broder Sampler: Preprocessing Step C Large corpus L t 1, freq(t 1,C) t 2, freq(t 2,C) … … Lexicon

7 The Bharat-Broder Sampler Search Engine BB Sampler t 1 AND t 2 Top k results Random document from top k results L Two random terms t 1, t 2 Only if: all queries return the same number of results ≤ k all documents are of the same length Then, samples are uniform. Only if: all queries return the same number of results ≤ k all documents are of the same length Then, samples are uniform.

8 The Bharat-Broder Sampler: Drawbacks Documents have varying lengths  Bias towards long documents Some queries have more than k matches  Bias towards documents with high static rank

9 Search Engines as Hypergraphs results(q) = { documents returned on query q } queries(x) = { queries that return x as a result } P = query pool = a set of queries Query pool hypergraph:  Vertices:Indexed documents  Hyperedges:{ result(q) | q  P } news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

10 Query Cardinalities and Document Degrees Query cardinality: card(q) = |results(q)|  card(“news”) = 4, card(“bbc”) = 3 Document degree: deg(x) = |queries(x)|  deg( = 1, deg(news.bbc.co.uk) = 2 Cardinality and degree are easily computable news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

11 Sampling documents uniformly Sampling documents from D uniformlyHard Sampling documents from D non-uniformly: Easier Will show later: can sample documents proportionally to their degrees:

12 Sampling documents by degree p(news.bbc.co.uk) = 2/13 p( = 1/ news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

13 Monte Carlo Simulation We need:Samples from the uniform distribution We have:Samples from the degree distribution Can we somehow use the samples from the degree distribution to generate samples from the uniform distribution? Yes! Monte Carlo Simulation Methods Rejection Sampling Importance Sampling Metropolis- Hastings Maximum- Degree

14 Monte Carlo Simulation  : Target distribution  In our case:  = uniform on D p: Trial distribution  In our case: p = degree distribution Bias weight of p(x) relative to  (x):  In our case: Monte Carlo Simulator Samples from p Sample from  x  Sampler (x 1,w(x)), (x 2,w(x)), … p-Sampler

15 Bias Weights Unnormalized forms of  and p: : (unknown) normalization constants Examples:   = uniform:  p = degree distribution: Bias weight:

16 C: envelope constant  C ≥ w(x) for all x The algorithm:  accept := false  while (not accept) generate a sample x from p toss a coin whose heads probability is if coin comes up heads, accept := true  return x In our case: C = 1 and acceptance prob = 1/deg(x) Rejection Sampling [von Neumann]

17 Pool-Based Sampler Degree distribution sampler Search Engine Rejection Sampling q 1,q 2,… results(q 1 ), results(q 2 ),… x Pool-Based Sampler (x 1,1/deg(x 1 )), (x 2,1/deg(x 2 )),… Uniform sample Documents sampled from degree distribution with corresponding weights Degree distribution: p(x) = deg(x) /  x’ deg(x’)

18 Sampling documents by degree Select a random query q Select a random x  results(q) Documents with high degree are more likely to be sampled If we sample q uniformly  “oversample” documents that belong to narrow queries We need to sample q proportionally to its cardinality news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

19 Sampling documents by degree (2) Select a query q proportionally to its cardinality Select a random x  results(q) Analysis: news.google.com news.bbc.co.uk maps.google.com maps.yahoot.com “news” “bbc” “google” “maps” en.wikipedia.org/wiki/BBC

20 Degree Distribution Sampler Search Engine results(q) x Cardinality Distribution Sampler Sample x uniformly from results(q) q Degree Distribution Sampler Query sampled from cardinality distribution Document sampled from degree distribution

21 Sampling queries by cardinality Sampling queries from pool uniformly:Easy Sampling queries from pool by cardinality: Hard  Requires knowing cardinalities of all queries in the search engine Use Monte Carlo methods to simulate biased sampling via uniform sampling:  Target distribution: the cardinality distribution  Trial distribution: uniform distribution on the query pool

22 Sampling queries by cardinality Bias weight of cardinality distribution relative to the uniform distribution:  Can be computed using a single search engine query Use rejection sampling:  Envelope constant for rejection sampling:  Queries are sampled uniformly from the pool  Each query q is accepted with probability

23 Degree Distribution Sampler Complete Pool-Based Sampler Search Engine Rejection Sampling x (x,1/deg(x)),… Uniform document sample Documents sampled from degree distribution with corresponding weights Uniform Query Sampler Rejection Sampling (q,card(q)),… Uniform query sample Query sampled from cardinality distribution (q,results(q)),…

24 Dealing with Overflowing Queries Problem: Some queries may overflow (card(q) > k)  Bias towards highly ranked documents Solutions:  Select a pool P in which overflowing queries are rare (e.g., phrase queries)  Skip overflowing queries  Adapt rejection sampling to deal with approximate weights Theorem: Samples of PB sampler are at most  -away from uniform. (  = overflow probability of P)

25 Creating the query pool C Large corpus P q1q1 … … Query Pool Example: P = all 3-word phrases that occur in C  If “to be or not to be” occurs in C, P contains: “to be or”, “be or not”, “or not to”, “not to be” Choose P that “covers” most documents in D q2q2

26 A random walk sampler Define a graph G over the indexed documents  (x,y)  E iff queries(x) ∩ queries(y) ≠   Run a random walk on G  Limit distribution = degree distribution  Use MCMC methods to make limit distribution uniform. Metropolis-Hastings Maximum-Degree Does not need a preprocessing step Less efficient than the pool-based sampler

27 Bias towards Long Documents

28 Relative Sizes of Google, MSN and Yahoo! Google = 1 Yahoo! = 1.28 MSN Search = 0.73

29 Random Sampling

30 Outline The random sampling model Mean estimation Median estimation O(n) time median algorithm (Floyd-Rivest) MST weight estimation (Chazelle- Rubinfeld-Trevisan)

31 The Random Sampling Model f: A n  B  A,B arbitrary sets  n: positive integer (think of n as large) Goal: given x  A n, compute f(x)  Sometimes, approximation of f(x) suffices Oracle access to input:  Algorithm does not have direct access to x  In order to probe x, algorithm sends queries to an “oracle”  Query: an index i  {1,…,n}  Answer: x i Objective: compute f with minimum number of queries

32 Motivation The most basic model for dealing with large data sets  Statistics  Machine learning  Signal processing  Approximation algorithms …… Algorithm’s resources are a function of # of queries rather than of the input length Sometimes, constant # of queries suffices

33 Adaptive vs. Non-adaptive Sampling Non-adaptive sampling  Algorithm decides which indices to query a priori.  Queries are performed in batch at a pre-processing step  Number of queries performed is the same for all inputs. Adaptive sampling  Queries are performed sequentially: Query i 1 Get answer x i1 Query i 2 Get answer x i2 … Algorithm stops whenever has enough information to compute f(x)  In order to decide which index to query, the algorithm can use answers to previous queries.  Number of queries performed may vary for different inputs. Example: OR of n bits

34 Randomization vs. Determinism Deterministic algorithms  Non-adaptive: always queries the same set of indices  Adaptive: choice of i t deterministically depends on answers to first t-1 queries Randomized algorithms  Non-adaptive: indices are chosen randomly according to some distribution (e.g., uniform)  Adaptive: i t is chosen randomly according to a distribution, which depends on the answers to previous queries Our focus: randomized algorithms

35 ( ,  )-approximation M: a randomized sampling algorithm M(x): output of M on input x  M(x) is a random variable  > 0: approximation error parameter 0 <  < 1: confidence parameter Definition: M is said to  -approximate f with confidence 1 - , if for all inputs x  A n,  Ex:  With probability ≥ 0.9,

36 Query Complexity Definition: qcost(M) = the maximum number of queries M performs on:  worst choice of input x  worst choice of random bits Definition: eqcost(M) = the expected number of queries M performs on:  worst choice of input x  expectation over random bits Definition: The query complexity of f is qc ,  (f) = min { qcost(M) | M  -approximates f with confidence 1-  }  eqc ,  similarly defined

37 dd Want relative approximation: Naïve algorithm:  Choose:i 1,…,i k (uniformly and independently)  Query:i 1,…,i k  Output: (sample mean) How large should k be? Estimating the Mean

38 Chernoff-Hoeffding Bound X 1,…,X n  i.i.d. random variables  have a bounded domain [0,1]  E[x i ] =  for all i By linearity of expectation: Theorem [Chernoff-Hoeffding Bound]: For all 0 <  < 1,

39 Analysis of Naïve Algorithm Lemma: queries suffice. Proof:  For i = 1,…,k, let X i = answer to i-th query   Then, output of algorithm:  By Chernoff-Hoeffding bound:

40 dd Want rank approximation: Sampling algorithm:  Choose: i 1,…,i k (uniformly and independently)  Query: i 1,…,i k  Output: (sample median) How large should k be? Estimating the Median

41 Analysis of Median Algorithm Lemma: queries suffice. Proof:  For j = 1,…,k, let 

42 The Selection Problem Input:  n real numbers x 1,…,x n  Integer k  {1,…,n} Output:  x i whose rank is k/n Ex:  k = 1: minimum  k = n: maximum  k = n/2: median Can be easily solved by sorting (O(n log n) time) Can we do it in O(n) time?

43 The Floyd-Rivest Algorithm Note: Our approximate median algorithm can be generalized to any quantile 0 < q < 1. Floyd-Rivest algorithm 1.Set  = 1/n 1/3,  = 1/n 2.Use approximate quantile algorithm for  Let x L be the element returned 3.Use approximate qunatile algorithm for  Let x R be the element returned 4.Let k L = rank(x L ) 5.Keep only elements in the interval [x L,x R ] 6.Sort these elements and output the element whose rank is k – k L + 1

44 Analysis of Floyd-Rivest Theorem: With probability 1 – O(1/n), the Floyd-Rivest algorithm finds the k-th largest number from the input at O(n) time. Proof:  Let x * = element of rank k/n  Lemma 1: With probability ≥ 1-2/n, x *  [x L,x R ]  Proof:

45 Analysis of Floyd-Rivest Let S = input elements that belong to [x L,x R ] Lemma 2: With probability ≥ 1-2/n, |S| ≤ O(n 2/3 ) Proof:    Therefore, with probability ≥ 1-2/n, at most 4  n = O(n 2/3 ) elements are between x L and x R Running time analysis:  O(n 2/3 log n): approximate quantile computations  O(n): calculation of rank(x L )  O(n): filtering elements outside [x L,x R ]  O(n 2/3 log n): sorting S

46 End of Lecture 11