1 The Complexity of Massive Data Set Computations Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002.

1 The Complexity of Massive Data Set Computations Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002

2 What Are Massive Data Sets? Examples The Web IP packets Supermarket transactions Telephone call graph Astronomical observations Characterizing properties Huge collections of raw data Data is generated and modified continuously Distributed over many sites Slow storage devices Data is not organized / indexed

3 Nontraditional Computational Challenges Restricted access to the data Random access: expensive “Streaming” access: more feasible Some data may be unavailable Fetching data is expensive Traditionally Cope with the difficulty of the problem Massive Date Sets Cope with the size of the data and the restricted access to it Sub-linear running time Ideally, independent of data size Sub-linear space Ideally, logarithmic in data size

4 Basic Framework Massive data set computations are typically: Approximate Randomized Have a restricted access regime Input Data Access Regime Algorithm Approximate Output $$ ($$ = randomness)

5 Prominent Computational Models for Massive Data Sets Sampling Computations –Sub-linear running time & space –Suitable for “insensitive” functions Data Stream Computations –Linear running time, sub-linear space –Can compute sensitive functions Sketch Computations –Suitable for distributed data

6 Sampling Computations Sampling Algorithm Approximation of f(x 1,…,x n ) x1x1 x2x2 xnxn Query input at random locations Can choose query distribution and can query adaptively Complexity measure: query complexity Applications –Statistical parameter estimation –Computational and statistical learning [Valiant 84, Vapnik 98] –Property testing [RS96,GGR96] $$

7 Data Stream Computations [HRR98, AMS96, FKSV99] x1x1 x2x2 x3x3 xnxn Data Stream Algorithm $$memory Input arrives in a one-way stream in arbitrary order Complexity measures: space and time per data item Approximation of f(x 1,…,x n ) Applications –Database (Frequency moments [AMS96]) –Networking (L p distance [AMS96, FKSV99, FS00, Indyk 00]) –Web Information Retrieval (Web crawling, Google query logs [CCF02])

8 Sketch Computations [GM98, BCFM98, FKSV99] compression Sketch Algorithm Approximation of f(x 11,…,x tk ) $$ Algorithm computes from data “sketches” sent from sites Complexity measure: sketch lengths Applications –Web Information Retrieval (Identifying document similarities [BCFM98]) –Networking (L p distance [FKSV99]) –Lossy compression, approximate nearest neighbor x 11 …x 1k x 21 …x 2k x t1 …x tk $$

9 Main Objective Develop general lower bound techniques Obtain lower bounds for specific functions Explore the limitations of the above computational models

10 General CC lower bounds [BJKS02b] Information Theory Communication Complexity Thesis Blueprint Statistical Decision Theory Sampling Computations Data Stream Computations Sketch Computations lower bounds for general functions [BKS01,B02] One-way and simultaneous CC lower bounds [BJKS02a] Reduction from simultaneous CC Reduction from one-way CC

11 Sampling Lower Bounds (with R. Kumar, and D. Sivakumar, STOC 2001, and Manuscript, 2002) Combinatorial lower bound [BKS01] –bounds the expected query complexity of every function –tends to be weak –based on a generalization of Boolean block sensitivity [Nisan 89] Statistical lower bounds –bound the query complexity of symmetric functions –via Hellinger distance: worst-case query complexity [BKS01] –via KL distance: expected query complexity [B02] –tend to be tight –work by a reduction from statistical hypothesis testing Information theory lower bound [B02] –bounds the worst-case query complexity of symmetric functions –has better dependence on the domain size

12 Main observation: Since for all x, w.p. 1 - , then: x,y  -disjoint  T(x),T(y) are “far” from each other Main Idea approximation set of x  -disjoint inputs approximation set of y approximation set of w  approximation:

13 Main Result Theorem For any symmetric f and  -disjoint inputs x,y, and for any algorithm that (  )-approximates f, Worst-case # of queries  1/h 2 (U x,U y ) log(1/  ) Expected # of queries  1/KL(U x,U y ) log(1/  ) U x – uniform query distribution on x: ( induced by: pick i u.a.r, output x i ) Hellinger: h 2 (U x,U y ) = 1 –  a (U x (a) U y (a)) ½ KL: KL(U x,U y ) =  a U x (a) log(U x (a) / U y (a))

14 Example: Mean 1 01 0 ½ +  ½ -  ½ +  X:y: h 2 (U x,U y ) = KL(U x,U y ) = O(  2 ) Theorem (originally, [CEG95]) Approximating the mean of n numbers in [0,1] to within  additive error requires    log  queries. Other applications: Selection functions, frequency moments, extractors and dispersers

15 1.For symmetric functions, WLOG, all queries are uniform without replacement 2.If # of queries is  n ½, can further assume queries are uniform with replacement 3.For any  -disjoint inputs x,y, 4.Hypothesis testing lower bounds via Hellinger distance (worst-case) via KL distance (expected) (cf. [Siegmund 85]) Proof Outline  approximation of f with k queries Hypothesis test of U x against U y with error  and k samples

16 Statistical Hypothesis Testing Black Box PQ Hypothesis Test k i.i.d. samples Black box contains either P or Q Test has to decide: “P” or “Q” Allowed error probability  Goal: minimize k

17 Sampling Algorithm  Hypothesis Test x,y:  -disjoint inputs Black Box UxUx UyUy Sampling Algorithm “U y ” - otherwise “U x ” – if output k i.i.d. samples

18 Hypothesis test for U x against U y with error  and k samples Lower Bound via Hellinger Distance Lemma (cf. Le Cam, Yang 90) 11 22 Corollary: k  1/h 2 (U x,U y ) log(1/  )

19 Communication Complexity [Yao 79] Alice f: X  Y  Z x  X y  Y f(x,y) R  (f) = randomized CC of f with error  $$ Bob $$

20 Multi-Party Communication f: X 1  …  X t  Z P1P1 P2P2 P3P3 PtPt f(x 1,…,x t ) x1x1 x2x2 x3x3 xtxt

21 t-party set-disjointness Example: Set-disjointness P i gets S i  [n], Theorem [KS87,R90]:R  (Disj 2 ) =  (n) Theorem [AMS96]:R  (Disj t ) =  (n/t 4 ) Best upper bound:R  (Disj t ) =  O(n/t)

22 Restricted Communication Models P1P1 P2P2 PtPt Referee P1P1 P2P2 PtPt f(x 1,…,x t ) One-Way Communication [PS84, Ablayev 93, KNR95] Simultaneous Communication [Yao 79] Reduces to data stream computations Reduces to sketch computations

23 Example: Disjointness  Frequency Moments F k (a 1,…,a m ) =  j  [n] (f j ) k k-th frequency moment Theorem [AMS96]: Input stream: a 1,…,a m  [n], For j  [n], f j = # of occurrences of j in a 1,…,a m Corollary:DS(F k ) = n  (1), k > 5 Best upper bounds:DS(F k ) = n O(1), k > 2 DS(F k ) = O(log n), k=0,1,2

24 Information Statistics Approach to Communication Complexity (with T.S. Jayram, R. Kumar, and D. Sivakumar, Manuscript 2002) Applications General CC lower bounds –t-party set-disjointness:  (n/t 2 ) (improving on [AMS96]) –L p (solving an open problem of [Saks-Sun 02]) –Inner product One-way CC lower bounds –t-party set-disjointness:  (n/t 1+  ) for any  > 0 Space lower bounds in the data stream model –frequency moments: n  (1),k > 2 ( proving conjecture of [AMS96]) –L p distance A novel lower bound technique for randomized CC based on statistics and information theory

25 Statistical View of Communication Complexity  – a  -error randomized protocol for f: X  Y  Z  (x,y) – distribution over transcripts Lemma: For any two input pairs (x,y), (x’,y’) with f(x,y)  f(x’,y’), V(  (x,y),  (x’,y’))  1 – 2  Proof: By reduction from hypothesis testing. Corollary: h 2 (  (x,y),  (x’,y’))  1 – 2  ½

26 CC lower bound For a protocol  that computes f, how much information does  (x,y) have to reveal about (x,y)?  = (X,Y) – a distribution over inputs of f Definition:  -information cost icost  (  ) = I (X,Y ;  (X,Y)) icost  (f) = min  {icost  (  )} I (X,Y ;  (X,Y))  H(  (X,Y))  |  (X,Y)| Information cost lower bound Information Cost [Ablayev 93, Chakrabarti et al. 01, Saks-Sun 02]

27 Direct Sum for Information Cost Decomposable functions: f(x,y) = g(h(x 1,y 1 ),…,h(x n,y n )), h: X i  Y i  {0,1}, g: {0,1} n  {0,1} Example: Set Disjointness Disj 2 (x,y) = (x 1 Λ  y 1 ) V … V (x n Λ  y n ) Theorem (direct sum): For appropriately chosen ,  ’, icost  (f)  n · icost  ’,  (h) Lower bound on icost(h) Lower bound on icost(f)

28 Information Cost of Single-Bit Functions In Disj 2,  ’ = ½  ’ 1 + ½  ’ 2, where:  ’ 1 = ½(1,0) + ½(0,0),  ’ 2 = ½(0,1) + ½(0,0) Lemma 1: For any protocol  for AND, icost  ’ (  )   (h 2 (  (0,1),  (1,0)) Lemma 2: h 2 (  (0,1),  (1,0)) = h 2 (  (1,1),  (0,0)) Corollary 1: icost  ’,  (AND)   (1 – 2  ½ ) Corollary 2: icost  (Disj 2 )   (n · (1 – 2  ½ ))

29 Proof of Lemma 2 “Rectangle” property of deterministic protocols: For any transcript , the set of all (x,y) with  (x,y) =  is a “combinatorial rectangle”: S  T, where S  X and T  Y “Rectangle” property of randomized protocols: For all x  X, y  Y, there exist functions p x : {0,1}*  [0,1] and q y : {0,1}*  [0,1], such that for any possible transcript , Pr(  (x,y) =  ) = p x (  ) · q y (  ) h 2 (  (0,1),  (1,0)) = 1 -   (Pr(  (0,1) =  ) · Pr(  (1,0) =  )) ½ = 1 –   (p 0 (  ) · q 1 (  ) · p 1 (  ) · q 0 (  )) ½ = h 2 (  (0,0),  (1,1))

30 Conclusions Studied limitations of computing on massive data sets –Sampling computations –Data stream computations –Sketch computations Lower bound methodologies are based on –Information theory –Statistical decision theory –Communication complexity Lower bound techniques: –Reveal novel aspects of the models –Present a “template” for obtaining specific lower bounds

31 Open Problems Sampling –Lower bounds for non-symmetric functions –Property testing lower bounds Communication complexity –Study the communication complexity of approximations –Tight lower bound for t-party set disjointness –Under what circumstances are one-way and simultaneous communication equivalent?

32 Thank You!

33 Yao’s Lemma [Yao 83] Definition:  -distributional CC (D  (f)) Complexity of best deterministic protocol that computes f with error   on inputs drawn according to  Yao’s Lemma: R  (f)  max  D  (f) Convenient technique to prove randomized CC lower bounds

34 Communication Complexity Lower Bounds via Information Theory (with T.S. Jayram, R. Kumar, and D. Sivakumar, Complexity 2002) A novel information theory paradigm for proving CC lower bounds Applications –Characterization results: (w.r.t. product distributions) 1-way  simultaneous 2-party 1-way  t-party 1-way VC dimension characterization of t-party 1-way CC –Optimal lower bounds for simultaneous CC t-party set-disjointness:  (n/t) Generalized addressing function

35 Information Theory senderreceiver noisy channel m  M r  R M – distribution of transmitted messages R – distribution of received messages Goal of receiver: reconstruct m from r  g : error probability of a reconstruction function g Fano’s Inequality: For all g, H 2 (  g )  H(M | R) MLE Principle:  MLE  H(M | R) For a Boolean M

36 Information Theory View of Distributional CC x,y distribute according to  (X,Y) “God” transmits f(x,y) to Alice & Bob Alice & Bob receive the transcript  (x,y) Fano’s inequality: For any  -error protocol  for f, H 2 (  )  H(f(X,Y) |  (X,Y)) f(x,y)  (x,y) “God” Alice & Bob CC protocol

37 Simultaneous CC vs. One-Way CC Theorem For every product distribution  = X  Y, and every Boolean f, D ,2H(  ),sim (f)  D , ,A  B (f) + D , ,B  A (f) Proof A(x) – message of A on x in a  -error A  B protocol for f B(y) – message of B on y in a  -error B  A protocol for f Construct a SIM protocol for f: A  Referee: A(x)B  Referee: B(y) Referee outputs MLE(f(X,Y) | A(x), B(y))

38 Simultaneous CC vs. One-Way CC Proof (cont.) By MLE Principle, Pr  (MLE(f(X,Y) | A(X),B(Y))  f(X,Y))  H(f(X,Y) | A(X),B(Y)) By Fano, H(f(X,Y) | A(X),Y)  H 2 (  ) and H(f(X,Y) | X,B(Y))  H 2 (  ) Lemma For independent X,Y, H(f(X,Y) | A(X),B(Y))  H(f(X,Y) | A(X),Y) + H(f(X,Y) | X,B(Y))  Our protocol errs with probability at most 2H 2 (  ) □

1 The Complexity of Massive Data Set Computations Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002.

Similar presentations

Presentation on theme: "1 The Complexity of Massive Data Set Computations Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 The Complexity of Massive Data Set Computations Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002.

Similar presentations

Presentation on theme: "1 The Complexity of Massive Data Set Computations Ziv Bar-Yossef Computer Science Division U.C. Berkeley Ph.D. Dissertation Talk May 6, 2002."— Presentation transcript:

Similar presentations

About project

Feedback