Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
Optimal Approximations of the Frequency Moments of Data Streams Piotr Indyk David Woodruff.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO Based on a paper in STOC, 2012.
The Complexity of Linear Dependence Problems in Vector Spaces David Woodruff IBM Almaden Joint work with Arnab Bhattacharyya, Piotr Indyk, and Ning Xie.
Tight Bounds for Distributed Functional Monitoring David Woodruff IBM Almaden Qin Zhang Aarhus University MADALGO.
Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT
Quantum Lower Bounds You probably Havent Seen Before (which doesnt imply that you dont know OF them) Scott Aaronson, UC Berkeley 9/24/2002.
Quantum Lower Bound for the Collision Problem Scott Aaronson 1/10/2002 quant-ph/ I was born at the Big Bang. Cool! We have the same birthday.
Quantum Lower Bounds The Polynomial and Adversary Methods Scott Aaronson September 14, 2001 Prelim Exam Talk.
Quantum Versus Classical Proofs and Advice Scott Aaronson Waterloo MIT Greg Kuperberg UC Davis | x {0,1} n ?
The Future (and Past) of Quantum Lower Bounds by Polynomials Scott Aaronson UC Berkeley.
Limitations of Quantum Advice and One-Way Communication Scott Aaronson UC Berkeley IAS Useful?
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
An Optimal Algorithm for the Distinct Elements Problem
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Efficient Private Approximation Protocols Piotr Indyk David Woodruff Work in progress.
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Xiaoming Sun Tsinghua University David Woodruff MIT
Truthful Mechanisms for Combinatorial Auctions with Subadditive Bidders Speaker: Shahar Dobzinski Based on joint works with Noam Nisan & Michael Schapira.
Shortest Vector In A Lattice is NP-Hard to approximate
Approximate List- Decoding and Hardness Amplification Valentine Kabanets (SFU) joint work with Russell Impagliazzo and Ragesh Jaiswal (UCSD)
Separating Deterministic from Randomized Multiparty Communication Complexity Joint work with Paul Beame (University of Washington) Matei David (University.
Circuit and Communication Complexity. Karchmer – Wigderson Games Given The communication game G f : Alice getss.t. f(x)=1 Bob getss.t. f(y)=0 Goal: Find.
The Communication Complexity of Approximate Set Packing and Covering
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
1 Deciding Primality is in P M. Agrawal, N. Kayal, N. Saxena Presentation by Adi Akavia.
Foundations of Cryptography Lecture 4 Lecturer: Moni Naor.
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 25, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
CPSC 689: Discrete Algorithms for Mobile and Wireless Systems Spring 2009 Prof. Jennifer Welch.
Deciding Primality is in P M. Agrawal, N. Kayal, N. Saxena Slides by Adi Akavia.
Quantum Algorithms II Andrew C. Yao Tsinghua University & Chinese U. of Hong Kong.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 13 June 22, 2005
Complexity Theory Lecture 2 Lecturer: Moni Naor. Recap of last week Computational Complexity Theory: What, Why and How Overview: Turing Machines, Church-Turing.
On Embedding Edit Distance into L_11 On Embedding Edit Distance into L 1 Robert Krauthgamer (Weizmann Institute and IBM Almaden)‏ Based on joint work (i)
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
1 Fingerprinting techniques. 2 Is X equal to Y? = ? = ?
Quantum Computing MAS 725 Hartmut Klauck NTU TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A.
Information Theory for Data Streams David P. Woodruff IBM Almaden.
1 Embedding and Similarity Search for Point Sets under Translation Minkyoung Cho and David M. Mount University of Maryland SoCG 2008.
The Cost of Fault Tolerance in Multi-Party Communication Complexity Binbin Chen Advanced Digital Sciences Center Haifeng Yu National University of Singapore.
Data Stream Algorithms Lower Bounds Graham Cormode
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
Tight Bound for the Gap Hamming Distance Problem Oded Regev Tel Aviv University TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
The Message Passing Communication Model David Woodruff IBM Almaden.
Information Complexity Lower Bounds
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Communication Amid Uncertainty
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Lower Bound Theory.
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Linear sketching with parities
Near-Optimal (Euclidean) Metric Compression
Linear sketching over
Linear sketching with parities
Imperfectly Shared Randomness
Communication Amid Uncertainty
Lecture 15: Least Square Regression Metric Embeddings
Clustering.
Presentation transcript:

Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk

The Problem Stream of elements a 1, …, a n each in {1, …, m} Want F 0 = # of distinct elements Elements in adversarial order Algorithms given one pass over stream Goal: Minimum-space algorithm …

A Trivial Algorithm … Keep m-bit characteristic vector v of stream j in stream $ v j = 1 F 0 = wt( ) = 5 Space = m Can we do better?

Negative Results Any algorithm computing F 0 exactly must use (m) space [AMS96] Any deterministic alg. that outputs x with |F 0 – x| < F 0 must use (m) space [AMS96] What about randomized approximation algorithms?

Rand. Approx. Algorithms for F 0 O(log log m/ 2 + log m log 1/ ) alg. outputs x with Pr[| F 0 – x| ¾ [BJKST02] Lots of hashing tricks Is this optimal? Previous lower bounds (log m) [AMS96] (1/ ) [Bar-Yossef] Open Problem of [BJKST02]: GAP: 1/ << 1/ 2

Idea Behind Lower Bounds x 2 {0,1} m y 2 {0,1} m Stream s(x) Stream s(y) (1 § ) F 0 algorithm A Internal state of A Compute (1 § ) F 0 (s(x) ± s(y)) w.p. > ¾ Idea: If can decide f(x,y) w.p. > ¾, space used by A at least fs rand. 1-way comm. complexity S AliceBob

Randomized 1-way comm. complexity Boolean function f: X £ Y ! {0,1} Alice has x 2 X, Bob y 2 Y. Bob wants f(x,y) Only 1 message sent: must be from Alice to Bob Comm. cost of protocol = expected length of longest message sent over all inputs. -error randomized 1-way comm. complexity of f, R (f), is comm. cost of optimal protocol computing f w.p. ¸ 1- How do we lower bound R (f)?

The VC Dimension [KNR] F = {f : X ! {0,1}} family of Boolean functions f 2 F is length-|X | bit string For S µ X, shatter coefficient SC(f S ) of S is |{f | S } f 2 F | = # distinct bit strings when F restricted to S SC(F, p) = max S 2 X, |S| = p SC(f S ) If SC(f S ) = 2 |S|, S shattered by F VC Dimension of F, VCD(F), = size of largest S shattered by F

Shatter Coefficient Theorem Notation: For f: X £ Y ! {0,1}, define: f X = { f x (y) : Y ! {0,1} | x 2 X }, where f x (y) = f(x,y) Theorem [BJKS]: For every f: X £ Y ! {0,1}, every p ¸ VCD( f X ), R 1/4 (f) = (log(SC(f X, p)))

The (1/ ) Lower Bound [Bar-Yossef] Alice has x 2 R {0,1} m, wt(x) = m/2 Bob has y 2 R {0,1} m, wt(y) = m and: Either wt(x Æ y) = 0 OR wt(x Æ y) = m f(x,y) = 0 f(x,y) = 1 R 1/4 (f) = (VCD(f X )) = (1/ ) [Bar-Yossef] s(x), s(y) any streams w/char. vectors x, y f(x,y) = 1 ! F 0 (s(x) ± s(y)) = m/2 f(x,y) = 0 ! F 0 (s(x) ± s(y)) = m/2 + m (1+)m/2 < (1 -)(m/2 + m) for = ( ) Hence, can decide f ! F 0 alg. uses (1/ ) space

Our Results Remainder of talk: (1/ 2 ) lower bound for = (m -1/(9+k) ) for any k > 0. ! O(log log m/ 2 + log m log 1/ ) upper bound almost optimal IDEA: Reduce from protocol for computing dot product

The Promise Problem X = {x 2 [0,1] t, ||x|| = 1 and 9 y 2 Y s.t. (x,y) 2 } We lower bound R 1/4 (f) via SC(f X, t) t = (1/ 2 ), Y = basis of unit vectors of R t x 2 [0,1] t ||x|| = 1 y 2 Y AliceBob Promise Problem : h x,y i = 0 h x,y i = 2/t 1/2 f(x,y) = 0 OR f(x,y) = 1

Bounding SC(f X, t) Theorem: SC(f X, t/4) = 2 (t) Proof: 1. 8 T ½ {Y} s.t. |T| = t/4, put x T = (2/t 1/2 ) ¢ e 2 T e 2.Define X 1 ½ X as X 1 = {x T | T ½ {Y}, |T| = t/4} 3.Claim: 8 s 2 {0,1} t w/ wt(x) = t/4, s 2 truth tab. of f X 1 4.Proof: 1.Let s 2 {0,1} t with 1s in positions i 1, …, i t/4 2.Put T = {e i1, …, e it/4 }. 8 e 2 T, he, x T i = 2/t 1/2 = e 2 Y - T, h e, x T i = 0 5.There are 2 (t) such s.

Bounding R 1/4 (f) Corollary: ReductionReduction: we need protocol computing f with communication = space used by any (1 § ) F 0 approx. alg.

Reduction Recall: hx,yi = 0 if f(x,y) = 0 hx,yi = 2/t 1/2 if f(x,y) = 1 Goal:Goal: Reduce separation of hx,yi to separation of F 0 (s(x) ± s(y)) for streams s(x),s(y) Alice/Bob can derive from x,y Use relation: ||y-x|| 2 = ||y|| 2 + ||x|| 2 – 2hx, yi f(x,y) = 0 ! ||y-x|| = 2 1/2 f(x,y) = 1 ! ||y-x|| < 2 1/2 (1- 1/t 1/2 ) = 2 1/2 (1 - ( ))

Overview of Reduction x 2 [0,1] t ||x|| = 1 y 2 E 1.Low-distortion embedding : l 2 t ! l 1 poly(t) 2. Rational Approximation (x) (y) 3. Scale rationals to integers s 4. Convert integer coords to unary to get {0,1} vectors x,y x y F 0 (s(x) ± s(y)) can decide f(x,y) w.p. ¸ 3/4 F 0 Alg F 0 (s(x) ± s(y)) F 0 Alg State s(x)s(y)

Embedding l 2 t into l 1 poly(t) A (1+ )-distortion embedding : l 2 t ! l 1 d is mapping s.t. 8 p,q 2 l 2 t, Theorem [FLM77]: 8 9 a (1+ )- distortion embedding : l 2 t ! l 1 d with:

Embedding l 2 t into l 1 d x 2 [0,1] t ||x|| = 1 y 2 E Low-distortion embedding : l 2 t ! l 1 d (x) (y) Using Theorem [FLM77], Alice/Bob get (x), (y) 2 R d with d = O(t ¢ (log 1/ ) / 2 ): specified later

Rational Approximation z = z(t): N ! N; assume z ¸ d Approximate each coord. of output of embedding by integer multiple of 1/z

Scaling Alice (resp. Bob) multiplies each coord. of (resp. ) by z Obtains s( ) (resp. s( ) Claim: coords. are integers in range [-2z, 2z] Proof: 1. | | · | (¢)| + d/z · 2 2. |s( )| = z| |

Converting to Unary For i=1 to d j à s( ) i Replace s( ) i with 1 2z+j 0 2z-j Bob does same for s( ) x, y denote new length 4dz bitstrings wt(x) = |s( )|, wt(y) = |s( )| (x,y) = |s( ) – s( )|

Reducing (x,y) to F 0 Alice (Bob) chooses stream a x (a y ) with char. vector x (y). Lemma: If 1 < wt(x), wt(y) < 2, then: 1 + (x,y)/2 < F 0 (a x ± a y ) < 2 + (x,y)/2 Follows from fact: F 0 (a x ± a y ) = wt(x Ç y)

Reducing (x,y) to F 0 Use lemma to show: Set = ( ), z = (1/ 5 log 1/ ) so that two cases distinguished by (1 § ( )) F 0 alg

Conclusions a x, a y must be in universe of size ¸ 4zd = (log (1/ )/ 9 ) Reduction only valid if 4zd · m (1/ 2 ) bound for = (m -1/(9+k) ) 8 k > 0. Recently lower bound improved to: (1/ 2 ) for ¸ m -1/2, which is optimal Find set of vectors directly in Hamming space via involved prob. method argument