Sublinear Algorihms for Big Data

Slides:



Advertisements
Similar presentations
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Advertisements

Optimal Space Lower Bounds for all Frequency Moments David Woodruff Based on SODA 04 paper.
The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Chapter 23 Minimum Spanning Tree
Matroid Bases and Matrix Concentration
Describing Quantitative Variables
Weighted graphs Example Consider the following graph, where nodes represent cities, and edges show if there is a direct flight between each pair of cities.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Heavy Hitters Piotr Indyk MIT. Last Few Lectures Recap (last few lectures) –Update a vector x –Maintain a linear sketch –Can compute L p norm of x (in.
A Randomized Linear-Time Algorithm to Find Minimum Spanning Trees David R. Karger David R. Karger Philip N. Klein Philip N. Klein Robert E. Tarjan.
CS774. Markov Random Field : Theory and Application Lecture 06 Kyomin Jung KAIST Sep
Visual Recognition Tutorial
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
DAST 2005 Tirgul 14 (and more) sample questions. DAST 2005 (reminder?) Kruskal’s MST Algorithm.
Approximation Algorithms for the Traveling Salesperson Problem.
Statistic estimation over data stream Slides modified from Minos Garofalakis ( yahoo! research) and S. Muthukrishnan (Rutgers University)
Visual Recognition Tutorial
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Randomized Algorithms Morteza ZadiMoghaddam Amin Sayedi.
Sketching, Sampling and other Sublinear Algorithms: Algorithms for parallel models Alex Andoni (MSR SVC)
Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)
Design and Analysis of Computer Algorithm September 10, Design and Analysis of Computer Algorithm Lecture 5-2 Pradondet Nilagupta Department of Computer.
Theory of Computing Lecture 10 MAS 714 Hartmut Klauck.
Order Statistics The ith order statistic in a set of n elements is the ith smallest element The minimum is thus the 1st order statistic The maximum is.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
Minimum Routing Cost Spanning Trees Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
1 Probability and Statistical Inference (9th Edition) Chapter 5 (Part 2/2) Distributions of Functions of Random Variables November 25, 2015.
Sub-linear Time Algorithms Ronitt Rubinfeld Tel Aviv University Fall 2015.
Big Data Lecture 5: Estimating the second moment, dimension reduction, applications.
Algorithm Design and Analysis June 11, Algorithm Design and Analysis Pradondet Nilagupta Department of Computer Engineering This lecture note.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Tirgul 12 Solving T4 Q. 3,4 Rehearsal about MST and Union-Find
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Minimum Spanning Tree Chapter 13.6.
Approximating the MST Weight in Sublinear Time
LECTURE 03: DECISION SURFACES
Finding Frequent Items in Data Streams
Estimating L2 Norm MIT Piotr Indyk.
Minimum Spanning Tree 8/7/2018 4:26 AM
Streaming & sampling.
CS4234 Optimiz(s)ation Algorithms
Lecture 18: Uniformity Testing Monotonicity Testing
COMS E F15 Lecture 2: Median trick + Chernoff, Distinct Count, Impossibility Results Left to the title, a presenter can insert his/her own image.
Lecture 4: CountSketch High Frequencies
Lecture 7: Dynamic sampling Dimension Reduction
CIS 700: “algorithms for Big Data”
Connected Components Minimum Spanning Tree
Linear sketching with parities
Matrix Martingales in Randomized Numerical Linear Algebra
CSCI B609: “Foundations of Data Science”
Linear sketching over
Chernoff bounds The Chernoff bound for a random variable X is
CSCI B609: “Foundations of Data Science”
Linear sketching with parities
CSCI B609: “Foundations of Data Science”
Class Project Survey of 3-5 research papers
CSCI B609: “Foundations of Data Science”
Lecture 6: Counting triangles Dynamic graphs & sampling
Weighted Graphs & Shortest Paths
Generalization bounds for uniformly stable algorithms
CIS 700: “algorithms for Big Data”
Modelling and Searching Networks Lecture 5 – Random graphs
Minimum Spanning Trees (MSTs)
Approximation Algorithms
Discrete Mathematics and its Applications Lecture 5 – Random graphs
Sublinear Algorihms for Big Data
Locality In Distributed Graph Algorithms
Presentation transcript:

Sublinear Algorihms for Big Data Exam Problems Grigory Yaroslavtsev http://grigory.us

Problem 1: Modified Chernoff Bound Problem: Derive Chernoff Bound #2 from Chernoff Bound #1. Explain your answer in detail. (Chernoff Bound #1) Let 𝑿 1 … 𝑿 𝒕 be independent and identically distributed r.vs with range [0, 1] and expectation 𝜇. Then if 𝑋= 1 𝒕 𝑖 𝑋 𝑖 and 1>𝛿>0, Pr 𝑋 −𝜇 ≥𝛿𝜇 ≤2 exp − 𝒕𝜇 𝛿 2 3 (Chernoff Bound #2) Let 𝑿 1 … 𝑿 𝒕 be independent and identically distributed r.vs with range [0, 𝒄] and expectation 𝜇. Then if 𝑋= 1 𝒕 𝑖 𝑋 𝑖 and 1>𝛿>0, Pr 𝑋 −𝜇 ≥𝛿𝜇 ≤2 exp − 𝒕𝜇 𝛿 2 3𝒄

Problem 2: Modified Chebyshev Problem: Derive Chebyshev Inequality #2 from Chebyshev Inequality #1. Explain your answer in detail. (Chebyshev Inequality #1) For every 𝑐>0: Pr 𝑿 −𝔼 𝑿 ≥𝑐 𝑉𝑎𝑟 𝑿 ≤ 1 𝑐 2 (Chebyshev Inequality #2) For every 𝑐′>0: Pr 𝑿 −𝔼 𝑿 ≥𝑐′ 𝔼 𝑿 ≤ 𝑉𝑎𝑟 𝑿 𝑐′ 𝔼 𝑿 2

Problem 3: Sparse Recovery Error Definition (frequency vector): 𝑓 𝑖 = frequency of 𝑖 in the stream = # of occurrences of value 𝑖 𝑓=〈 𝑓 1 ,…, 𝑓 𝒏 〉 Sparse Recovery: Find 𝑔 such that 𝑓 −𝑔 1 is minimized among 𝑔′s with at most 𝑘 non-zero entries. Definition: 𝐸𝑟 𝑟 𝑘 𝑓 = min g: g 0 ≤𝑘 𝑓 −𝑔 1 Problem: Show that 𝐸𝑟 𝑟 𝑘 𝑓 = 𝑖∉𝑆 𝑓 𝑖 where 𝑆 are indices of 𝑘 largest 𝑓 𝑖 . Explain your answer formally and in detail.

Problem 4: Approximate Median Value Stream: 𝒎 elements 𝑥 1 ,… 𝑥 𝒎 from universe 𝒏 = 1, 2, …, 𝒏 . 𝑆={ 𝑥 1 ,…, 𝑥 𝑚 } (all distinct) and let 𝑟𝑎𝑛𝑘 𝑦 =|𝑥∈𝑆 :𝑥≤𝑦| Median 𝑴= 𝑥 𝑖 , where 𝑟𝑎𝑛𝑘 𝑥 𝑖 = 𝒎 2 + 1 (𝒎 odd). Algorithm: Return 𝒚= the median of a sample of size 𝒕 taken from 𝑆 (with replacement). Problem: Does this algorithm give a 10% approximate value of the median with probability ≥ 2 3 , i.e. 𝑦 such that 𝑴− 𝒏 10 <𝒚<𝑴+ 𝒏 10 if 𝒕=𝑜(𝒏)? Explain your answer.

Problem 5: Lower bound on 𝐹 𝑘 Definition (frequency vector): 𝑓 𝑖 = frequency of 𝑖 in the stream = # of occurrences of value 𝑖 𝑓=〈 𝑓 1 ,…, 𝑓 𝒏 〉 Define (frequency moment): 𝐹 𝑘 = 𝑖 𝑓 𝑖 𝑘 Problem: Show that for all integer k≥1 it holds that: 𝐹 𝑘 ≥𝑛 𝑚 𝑛 𝑘 Hint: worst-case when 𝑓 1 =…= 𝑓 𝑛 = 𝑚 𝑛 . Use convexity of 𝑔 𝑥 = 𝑥 𝑘

Problem 6: Approximate MST Weight Let 𝐺 be a weighted graph with weights of all edges being integers between 1 and W. Definition: Let 𝑛 𝑖 be the # of connected components if we remove all edges with weight > 1+𝜖 𝑖 . Problem: For some constant 𝑪 > 0 show the following bounds on the weight of the minimum spanning tree 𝑤 𝑀𝑆𝑇 : 𝑤 𝑀𝑆𝑇 ≤ 𝑖=0 ⌈log 1+𝜖 𝑊 ⌉ 𝜖 1+𝜖 𝑖 𝑛 𝑖 ≤ 1+𝑪 𝜖 𝑤(𝑀𝑆𝑇)

Evaluation You can work in groups (up to 3 people) Point system Deadline: September 1st , 2014, 23:59 GMT Submissions in English: grigory@grigory.us Submission Title (once): Exam + Space + “Your Name” Question Title: Question + Space + “Your Name” Submission format PDF from LaTeX (best) PDF You can work in groups (up to 3 people) Each group member lists all others on the submission Point system Course Credit = 2 points Problems 1-3 = 0.5 point each Problem 4 = 1 point Problems 5-6 = 2 point each