Limits of Data Structures

Slides:



Advertisements
Similar presentations
Lower Bounds for 2-Dimensional Range Counting Mihai Pătraşcu
Advertisements

MATH 224 – Discrete Mathematics
Algorithmic High-Dimensional Geometry 1 Alex Andoni (Microsoft Research SVC)
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Comments We consider in this topic a large class of related problems that deal with proximity of points in the plane. We will: 1.Define some proximity.
Lecture 3: Parallel Algorithm Design
Tight Bounds for Dynamic Convex Hull Queries (Again) Erik DemaineMihai Pătraşcu.
Voronoi Diagrams in n· 2 O(√lglg n ) Time Timothy M. ChanMihai Pătraşcu STOC’07.
How to Grow Your Lower Bounds Mihai P ă trașcu Tutorial, FOCS’10.
Multidimensional Indexing
Big Data Lecture 6: Locality Sensitive Hashing (LSH)
Lower Bound Techniques for Data Structures Mihai P ă trașcu … Committee: Erik Demaine (advisor) Piotr Indyk Mikkel Thorup.
1 Cell Probe Complexity - An Invitation Peter Bro Miltersen University of Aarhus.
(Data) STRUCTURES Mihai P ă trașcu. LSD dyn. marked ancestor reachability oracles in the butterfly partial match(1+ε)-ANN ℓ 1, ℓ 2 NN in ℓ 1, ℓ 2 3-ANN.
1 Hashing, randomness and dictionaries Rasmus Pagh PhD defense October 11, 2002.
SODA Jan 11, 2004Partial Sums1 Tight Bounds for the Partial-Sums Problem Mihai PǎtraşcuErik Demaine (presenting) MIT CSAIL.
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
Data Structures: Range Queries - Space Efficiency Pooya Davoodi Aarhus University PhD Defense July 4, 2011.
1 Geometric index structures April 15, 2004 Based on GUW Chapter , [Arge01] Sections 1, 2.1 (persistent B- trees), 3-4 (static versions.
Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Simons Inst. / Columbia) Robert Krauthgamer (Weizmann Inst.) Ilya Razenshteyn (MIT, now.
Fast binary and multiway prefix searches for pachet forwarding Author: Yeim-Kuan Chang Publisher: COMPUTER NETWORKS, Volume 51, Issue 3, pp , February.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
FLANN Fast Library for Approximate Nearest Neighbors
Towards Polynomial Lower Bounds for Dynamic Problems STOC 2010 Mihai P ă trașcu.
Problems and MotivationsOur ResultsTechnical Contributions Membership: Maintain a set S in the universe U with |S| ≤ n. Given an x in U, answer whether.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Information Theory for Data Streams David P. Woodruff IBM Almaden.
A Worst Case, Constant Time Priority Queue: Beating a Lower Bound Ian Munro University of Waterloo Joint work with Andrej Brodnik (Ljubljana & Luleå) Svante.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Geometric Problems in High Dimensions: Sketching Piotr Indyk.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Data Stream Algorithms Lower Bounds Graham Cormode
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
Internal Memory Pointer MachineRandom Access MachineStatic Setting Data resides in records (nodes) that can be accessed via pointers (links). The priority.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
School of Computing Clemson University Fall, 2012
Information Complexity Lower Bounds
A Worst Case, Constant Time Priority Queue
New Characterizations in Turnstile Streams with Applications
Lower bounds for approximate membership dynamic data structures
Improved Randomized Algorithms for Path Problems in Graphs
Summary of lectures Introduction to Algorithm Analysis and Design (Chapter 1-3). Lecture Slides Recurrence and Master Theorem (Chapter 4). Lecture Slides.
Randomized Algorithms
Design and Analysis of Algorithms (07 Credits / 4 hours per week)
Algorithm Analysis (not included in any exams!)
Randomized Algorithms CS648
Lecture 7: Dynamic sampling Dimension Reduction
Data-Dependent Hashing for Nearest Neighbor Search
Counting (Pigeon Hole)
Parallel Algorithms for Geometric Graph Problems
Objective of This Course
Randomized Algorithms
Near-Optimal (Euclidean) Metric Compression
Locality Sensitive Hashing
Linear Sorting Sorting in O(n) Jeff Chastine.
Unit-4: Dynamic Programming
Searching: linear & binary
Chapter 11 Limitations of Algorithm Power
Asst. Dr.Surasak Mungsing
CS 3343: Analysis of Algorithms
Topic 5: Heap data structure heap sort Priority queue
Hash Functions for Network Applications (II)
President’s Day Lecture: Advanced Nearest Neighbor Search
New Jersey, October 9-11, 2016 Field of theoretical computer science
Approximation Algorithms
CSE 326: Data Structures Lecture #14
Sorting We have actually seen already two efficient ways to sort:
Design and Analysis of Algorithms (04 Credits / 4 hours per week)
(Learned) Frequency Estimation Algorithms
Presentation transcript:

Limits of Data Structures Mihai Pătraşcu …until Aug’08

“What problem could I work on?” MIT: The beginning Freshman year, 2002 … didn’t quite solve it  “What problem could I work on?” “P vs. NP”

The partial sums problem Here’s a small problem: Textbook solution: “augmented” binary search trees running time: O(lg n) / operation Maintain an array A[n] under: update(i, Δ): A[i] += Δ sum(i): return A[0] + … + A[i] + A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] + + A[6] + + update(2, Δ ) sum(6)

Now show Ω(lg n) needed… big open See also: [Fredman JACM ’81] [Fredman JACM ’82] [Yao SICOMP ’85] [Fredman, Saks STOC ’89] [Ben-Amram, Galil FOCS ’91] [Hampapuram, Fredman FOCS ’93] [Chazelle STOC ’95] [Husfeldt, Rauhe, Skyum SWAT ’96] [Husfeldt, Rauhe ICALP ’98] [Alstrup, Husfeldt, Rauhe FOCS ’98] Here’s a small problem: Fact: Ω(lg n) was not known for any problem Maintain an array A[n] under: update(i, Δ): A[i] += Δ sum(i): return A[0] + … + A[i] So, you want to show SAT takes 2Ω(n) time??

Results [P., Demaine SODA’04] first Ω(lg n) lower bound (for p. sums) [P., Demaine STOC’04] Ω(lg n) for many interesting problems [P., Tarniţă ICALP’05] Ω(lg n) via epoch arguments Best Student Paper E.g. support both * list operations – concatenate, split, … * array operations – index Think Python: 1 2 3 Ω(lg n) 1 2 3 4 >>> a = [0, 1, 2, 3, 4] >>> a[2:2] = [9, 9, 9] >>> a [0, 1, 9, 9, 9, 2, 3, 4] >>> a[5] 2

What kind of “lower bound”? Lower bounds you can trust.TM Model of computation ≈ real computers: memory words of w > lg n bits (pointers = words) random access to memory any operation on CPU registers (arithmetic, bitwise…) Just prove lower bound on # memory accesses bottleneck

Begin Proof A textbook algorithm deserves a textbook lower bound

π time Maintain an array A[n] under: update(i, Δ): A[i] += Δ sum(i): return A[0] + … + A[i] Δ1 Δ2 Δ3 Δ4 Δ5 Δ6 The hard instance: π = random permutation for t = 1 to n: query: sum(π(t)) Δt= rand() update(π(t), Δt) Δ7 Δ8 Δ9 Δ10 Δ11 Δ12 Δ13 Δ14 Δ15 Δ16

Communication ≈ # memory locations Δ1 Δ2 Δ3 Δ4 Δ5 Δ6 Δ7 Δ8 Δ9 Δ10 Δ11 Δ13 Δ14 Δ16 Δ17 Δ12 t = 9,…,12 How can Mac help PC run ? t = 5, …, 8 t = 9,…,12 Communication ≈ # memory locations * read during * written during time

“Negligible additional give me Mem[0x73A2] Dude, it wasn’t written after t≥5 Mac begins by sending a Bloom filter of memory locations it has written I’m one of the only people on the planet who think Bloom filters are cool due to their theoretical applications, not due to their practical applications  Communication ≈ # memory locations * read during * written during t = 5, …, 8 t = 9,…,12 “Negligible additional communication”

How much information needs to be transferred? Δ1 Δ2 Δ3 Δ4 Δ5 Δ13 Δ14 Δ16 Δ17 Δ8 Δ7 Δ9 Δ1+Δ5+Δ3 +Δ7+Δ2 Δ1 Δ1+Δ5+Δ3 How much information needs to be transferred? Δ1+Δ5+Δ3+Δ7 +Δ2 +Δ8 +Δ4 time At least Δ5 , Δ5+Δ7 , Δ5+Δ7+Δ8 => i.e. at least 3 words (random values incompressible)

The general principle Lower bound = # down arrows How many down arrows? (in expectation) (2k-1) ∙ Pr[ ] ∙ Pr[ ] = (2k-1) ∙ ½ ∙ ½ = Ω(k) k operations k operations

Communication between periods of k items = Ω(k) Recap yellow period pink period Communication = # memory locations * read during * written during Communication between periods of k items = Ω(k) yellow period pink period * read during * written during # memory locations = Ω(k)

Putting it all together aaaa Ω(n/8) Ω(n/4) Every load instruction counted once @ lowest_common_ancestor( , ) write time read time Ω(n/8) Ω(n/2) Ω(n/8) Ω(n/4) Ω(n/8) total Ω(n lg n) time

Q.E.D. Augmented binary search trees are optimal. First “Ω(lg n)” for any dynamic data structure.

How about static data structures? “predecessor search” preprocess T = { n numbers } given q, find: max { y є T | y < q } “2D range counting” preprocess T = { n points in 2D } given rectangle R, count |T ∩ R| packet forwarding 70000 69000 68000 71000 SELECT count(*) FROM employees WHERE salary <= 70000 AND startdate <= 1998

Lower bounds, pre-2006 Approach: communication complexity

Lower bounds Pre-2006 Approach: communication complexity Then what’s the difference between S=O(n) and S=O(n2) ? Approach: communication complexity lg S bits 1 word lg S bits 1 word database of size S

 Between space S=O(n) and S=poly(n) : lower bound changes by O(1) upper bound changes dramatically  space S=O(n2) precompute all answers query time = 1

  , [ First separation between space S=O(n) and S=poly(n) STOC’06] lower bound changes by O(1) upper bound changes dramatically First separation between space S=O(n) and S=poly(n)   , [ STOC’06]

First separation between space S=O(n) and S=poly(n) Processor  memory bandwidth: one processor: lg S k processors: lg ( ) ≈ k lg amortized lg(S/k) / processor S k S k amortizing across many processor saves bandwidth S=O(n) S=O(n2) k = 1 lg n 2lg n k = n/lg n lglg n ~ lg n

Since then… predecessor search [P., Thorup STOC’06] [P., Thorup SODA’07] searching with wildcards [P., Thorup FOCS’06] 2D range counting [P. STOC’07] range reporting [Karpinski, Nekrich, P. 2008] nearest neighbor (LSH) [2008 ?]

Packet Forwarding/ Predecessor Search Preprocess n prefixes of ≤ w bits:  make a hash-table H with all prefixes of prefixes  |H|=O(n∙w), can be reduced to O(n) Given w-bit IP, find longest matching prefix:  binary search for longest ℓ such that IP[0: ℓ] є H [van Emde Boas FOCS’75] [Waldvogel, Varghese, Turener, Plattner SIGCOMM’97] [Degermark, Brodnik, Carlsson, Pink SIGCOMM’97] [Afek, Bremler-Barr, Har-Peled SIGCOMM’99] O(lg w)

Predecessor Search: Timeline after [van Emde Boas FOCS’75] … O(lg w) has to be tight! [Beame, Fich STOC’99] slightly better bound with O(n2) space … must improve the algorithm for O(n) space! [P., Thorup STOC’06] tight Ω(lg w) for space O(n polylg n) !

Lower Bound Creed stay relevant to broad computer science (talk about binary search trees, packet forwarding, range queries, nearest neighbor …) never bow before the big problems (first Ω(lg n) bound; first separation between space O(n) and poly(n) ; …) strive for the elegant solution

Change of topic: Quad-trees excellent for “nice” faces (small aspect ratio)  in worst-case, can have prohibitive size infinite (??)

Quad-trees Est. 1992 Big theoretical problem:  use bounded precision in geometry (like 1D: hashing, radix sort, van Emde Boas…) [P. FOCS’06] [Chan FOCS’06]  a “quad-tree” of guaranteed linear size Est. 1992

 Theory Practice n∙2O(√lglg n) [P. FOCS’06] [Chan FOCS’06] point location [Chan, P. STOC’07] 3D convex hull 2D Voronoi 2D Euclidean MST triangulation with holes line-segment intersection [Demaine, P. SoCG’07] dynamic convex hull O(√lg u)  n∙2O(√lglg n)

Other Directions… High-dimensional geometry: Streaming algorithms: [Andoni, Indyk, P. FOCS’06] [Andoni, Croitoru, P. 2008] Streaming algorithms: [Chakrabarti, Jayram, P. SODA’08] Dynamic optimality: [Demaine, Harmon, Iacono, P. FOCS’04] + manuscript 2008 [Adler, Demaine, Harvey, P. SODA’06] Distributed Source Coding: Dynamic graph algorithms: [P., Thorup FOCS’07] [Chan, P., Roditty 2008] Hashing: [Mortensen, Pagh, P. STOC’05] [Baran, Demaine, P. WADS’05] [Demaine, M.a.d.H., Pagh, P. LATIN’06]

Questions?

Distributed source coding (I) x, y correlated i.e. H(x) + H(y) << H(x, y) Huffman coding: sensor 1 sends H(x) sensor 2 sends H(y) Goal: sensor 1 + sensor 2 send H(x, y) x y

Distributed source coding (II) Goal: sensor 1 + sensor 2 send H(x, y) Slepian-Wolf 1973:  achievable, with unidirectional communication  channel model (an infinite stream of i.i.d. x, y) Adler-Mags FOCS’98:  achievable for just one sample  bidirectional communication; needs i rounds with probability 2-i Adler-Demaine-Harvey-P. SODA’06 any protocol will need i rounds with probability 2-O(i∙lg i)

Distributed source coding (III) x, y correlated i.e. H(x) + H(y) << H(x, y) x y small Hamming distance small edit distance etc ? Network coding High-dimensional geometry