11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.

Slides:



Advertisements
Similar presentations
Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology
Advertisements

Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
O(N 1.5 ) divide-and-conquer technique for Minimum Spanning Tree problem Step 1: Divide the graph into  N sub-graph by clustering. Step 2: Solve each.
Efficiently searching for similar images (Kristen Grauman)
Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
Similarity Search in High Dimensions via Hashing
Parallel Algorithms for Geometric Graph Problems Grigory Yaroslavtsev 361 Levine STOC 2014, joint work with Alexandr Andoni, Krzysztof.
Parallel Algorithms for Geometric Graph Problems Alex Andoni (Microsoft Research) Joint with: Aleksandar Nikolov (Rutgers), Krzysztof Onak (IBM), Grigory.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Data-Powered Algorithms Bernard Chazelle Princeton University Bernard Chazelle Princeton University.
Biomedical imaging Sloan Digital Sky Survey 4 petabytes (~1MG) (~1MG) 10 petabytes/yr 150 petabytes/yr.
Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.
Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan.
Lecture 39 CSE 331 Dec 9, Announcements Please fill in the online feedback form Sample final has been posted Graded HW 9 on Friday.
CSE 589 Applied Algorithms Course Introduction. CSE Lecture 1 - Spring Instructors Instructor –Richard Ladner –206.
Summer School on Hashing’14 Locality Sensitive Hashing Alex Andoni (Microsoft Research)
Priority Models Sashka Davis University of California, San Diego June 1, 2003.
Lecture 27 CSE 331 Nov 6, Homework related stuff Solutions to HW 7 and HW 8 at the END of the lecture Turn in HW 7.
FLANN Fast Library for Approximate Nearest Neighbors
Embedding and Sketching Alexandr Andoni (MSR). Definition by example  Problem: Compute the diameter of a set S, of size n, living in d-dimensional ℓ.
Sketching, Sampling and other Sublinear Algorithms: Streaming Alex Andoni (MSR SVC)
Sketching, Sampling and other Sublinear Algorithms: Algorithms for parallel models Alex Andoni (MSR SVC)
Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Cpt S 223 – Advanced Data Structures Graph Algorithms: Minimum.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
1 Sublinear Algorithms Lecture 1 Sofya Raskhodnikova Penn State University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this.
Online Algorithms By: Sean Keith. An online algorithm is an algorithm that receives its input over time, where knowledge of the entire input is not available.
NEAREST NEIGHBORS ALGORITHM Lecturer: Yishay Mansour Presentation: Adi Haviv and Guy Lev 1.
Minimum Spanning Trees CS 146 Prof. Sin-Min Lee Regina Wang.
Data Structures and Algorithms in Parallel Computing Lecture 2.
Amplification and Derandomization Without Slowdown Dana Moshkovitz MIT Joint work with Ofer Grossman (MIT)
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Sketching, Sampling and other Sublinear Algorithms: Euclidean space: dimension reduction and NNS Alex Andoni (MSR SVC)
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
11 Algorithmic Techniques for Massive Data (COMS ) Alex Andoni.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Komplexitätstheorie und effiziente Algorithmen Christian Sohler, TU Dortmund Algorithms for geometric data streams.
Lecture 19 Minimal Spanning Trees CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
KNN & Naïve Bayes Hongning Wang
S IMILARITY E STIMATION T ECHNIQUES FROM R OUNDING A LGORITHMS Paper Review Jieun Lee Moses S. Charikar Princeton University Advanced Database.
Parametric calibration of speed–density relationships in mesoscopic traffic simulator with data mining Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2009/10/20.
Stream-based Geometric Algorithms
Chapter 7 – Specialized Routing
Approximating the MST Weight in Sublinear Time
Lecture 22 Complexity and Reductions
Lecture 18: Uniformity Testing Monotonicity Testing
Parametric calibration of speed–density relationships in mesoscopic traffic simulator with data mining Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2009/10/20.
Sublinear Algorithmic Tools 3
Sublinear Algorithmic Tools 2
Jianping Fan Dept of CS UNC-Charlotte
K Nearest Neighbor Classification
Lecture 10: Sketching S3: Nearest Neighbor Search
Mayank Bhatt, Jayasi Mehar
Autumn 2016 Lecture 11 Minimum Spanning Trees (Part II)
Lecture 16: Earth-Mover Distance
Parallel Algorithms for Geometric Graph Problems
CIS 700: “algorithms for Big Data”
Autumn 2015 Lecture 11 Minimum Spanning Trees (Part II)
CSCI B609: “Foundations of Data Science”
cse 521: design and analysis of algorithms
Richard Anderson Lecture 10 Minimum Spanning Trees
CS5112: Algorithms and Data Structures for Applications
Data Mining Classification: Alternative Techniques
Lecture 15: Least Square Regression Metric Embeddings
Clustering.
President’s Day Lecture: Advanced Nearest Neighbor Search
Lecture 14 Minimum Spanning Tree (cont’d)
Ronen Basri Tal Hassner Lihi Zelnik-Manor Weizmann Institute Caltech
Presentation transcript:

11 Lecture 24: MapReduce Algorithms Wrap-up

Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3 rd time: Fri at 3-4:30pm – sign-up sheet online Today: – MapReduce algorithms – Wrap-up 2

Computational Model

Model Constraints

Problem 1: sorting

Parallel algorithms from 80-90’s

Graph problems: connectivity VS

Geometric Graphs

Results: MST & EMD algorithms [A-Nikolov-Onak-Yaroslavtsev’14]

Framework: Solve-And-Sketch Partition the space hierarchically in a “nice way” In each part – Compute a pseudo-solution for the local view – Sketch the pseudo-solution using small space – Send the sketch to be used in the next level/round

MST algorithm: attempt 1 Partition the space hierarchically in a “nice way” In each part – Compute a pseudo-solution for the local view – Sketch the pseudo-solution using small space – Send the sketch to be used in the next level/round quad trees! local MST send any point as a representative

Difficulties Quad tree can cut MST edges – forcing irrevocable decisions Choose a wrong representative

New Partition: Grid Distance

MST Algorithm: Final Kruskal’s MST algorithm: connect the points with the shortest edge that does not introduce a cycle

MST Analysis Kruskal’s MST algorithm: connect the points with the shortest edge that does not introduce a cycle

MST Wrap-up

Wrap-up 18

1) Streaming algorithms Streaming algorithms Frequency moments, heavy hitters Graph algorithms Algorithms for lists: Median selection, longest increasing sequence Algorithms for geometric objects: clustering, MST, various approximation algorithms IPFrequency

Sketching &dimension reduction 20 d a t a DTA A source:

Nearest Neighbor Search Can use sketching for NNS Even better via Locality Sensitive Hashing Data-dependent LSH Embeddings: reduce harder distances to easier ones NNS for general metrics Complexity dependent on “intrinsic dimension”

Sampling, property testing 22 Distribution testing: – Get samples from a distribution, deduce its properties – Uniformity, identity – Many others in the literature! – Instance optimal: better for easier distributions Property testing: – Is this graph connected or far from connected? – For dense graphs: regularity lemma Sublinear time approximation: – Estimate the MST cost, matching size, etc

Parallel algorithms: MapReduce Model: limited space/machine Filtering: throw away part of the input locally, send only important stuff Dense graph algorithms Solve-And-Sketch: – find a partial solution locally – sketch the solution – work with sketches up Good for problems on points 23

Algorithms for massive data 24 Computer resources << data Access data in a limited way – Limited space (main memory << hard drive) – Limited time (time << time to read entire data) Introduction to Sublinear Algorithms Introduction to Sublinear Algorithms power of randomization and approximation