Sathya Ronak Alisha Zach Devin Josh

Slides:

Advertisements

Similar presentations

The Primal-Dual Method: Steiner Forest TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA A A A AA A A.

Advertisements

Concurrency Control for Machine Learning

A threshold of ln(n) for approximating set cover By Uriel Feige Lecturer: Ariel Procaccia.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh ETH Zurich November 3, 2014.

Accelerated, Parallel and PROXimal coordinate descent IPAM February 2014 APPROX Peter Richtárik (Joint work with Olivier Fercoq - arXiv: )

Bipartite Matching, Extremal Problems, Matrix Tree Theorem.

Bayesian Networks, Winter Yoav Haimovitch & Ariel Raviv 1.

A Model of Computation for MapReduce

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.

Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.

1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.

MATCHING WITH COMMITMENTS Joint work with Kevin Costello and Prasad Tetali China Theory Week 2011 Pushkar Tripathi Georgia Institute of Technology.

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

1 Chapter 7 Network Flow Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.

Message Passing for the Coloring Problem: Gallager Meets Alon and Kahale Sonny Ben-Shimon and Dan Vilenchik Tel Aviv University AofA June, 2007 TexPoint.

Computability and Complexity 24-1 Computability and Complexity Andrei Bulatov Approximation.

Finding a maximum independent set in a sparse random graph Uriel Feige and Eran Ofek.

Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.

Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)

Graph Sparsifiers Nick Harvey University of British Columbia Based on joint work with Isaac Fung, and independent work of Ramesh Hariharan & Debmalya Panigrahi.

Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Greedy Approximation Algorithms for finding Dense Components in a Graph Paper by Moses Charikar Presentation by Paul Horn.

Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.

DATA MINING LECTURE 13 Pagerank, Absorbing Random Walks Coverage Problems.

1 Maximal Independent Set. 2 Independent Set (IS): In a graph G=(V,E), |V|=n, |E|=m, any set of nodes that are not adjacent.

An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.

Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.

Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,

Artur Czumaj DIMAP DIMAP (Centre for Discrete Maths and it Applications) Computer Science & Department of Computer Science University of Warwick Testing.

Data Structures and Algorithms in Parallel Computing

1 Distributed Vertex Coloring. 2 Vertex Coloring: each vertex is assigned a color.

Distributed cooperation and coordination using the Max-Sum algorithm

TU/e Algorithms (2IL15) – Lecture 8 1 MAXIMUM FLOW (part II)

PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.

TensorFlow– A system for large-scale machine learning

Stochastic Streams: Sample Complexity vs. Space Complexity

Large-scale Machine Learning

New Characterizations in Turnstile Streams with Applications

CONNECTED-COMPONENTS ALGORITHMS FOR MESH-CONNECTED PARALLEL COMPUTERS

Approximating the MST Weight in Sublinear Time

Joseph E. Gonzalez Postdoc, UC Berkeley AMPLab

Maximal Independent Set

June 2017 High Density Clusters.

Lecture 18: Uniformity Testing Monotonicity Testing

Greedy Algorithms / Interval Scheduling Yin Tat Lee

Intro to Theory of Computation

MST in Log-Star Rounds of Congested Clique

Localizing the Delaunay Triangulation and its Parallel Implementation

CIS 700: “algorithms for Big Data”

Instructor: Shengyu Zhang

Logistic Regression & Parallel SGD

CSCI B609: “Foundations of Data Science”

On the effect of randomness on planted 3-coloring models

Problem Solving 4.

Richard Anderson Autumn 2016 Lecture 7

Approximating the Buffer Allocation Problem Using Epochs

Boltzmann Machine (BM) (§6.4)

Algorithms (2IL15) – Lecture 7

EE5900 Advanced Embedded System For Smart Infrastructure

Existence of 3-factors in Star-free Graphs with High Connectivity

Lecture 28 Approximation of Set Cover

A Variation of Minimum Latency Problem on Path, Tree and DAG

Reinforcement Learning (2)

Locality In Distributed Graph Algorithms

Analysis of Large Graphs: Overlapping Communities

Reinforcement Learning (2)

Presentation transcript:

Sathya Ronak Alisha Zach Devin Josh Send your check-ins Next Tuesday: 4th day of presentations (Learning Theory) Teams: Sathya Ronak Alisha Zach Devin Josh Scribe (2 students)

Dimitris Papailiopoulos Serially Equivalent + Scalable Parallel Machine Learning

Single Machine, Multi-core Today Single Machine, Multi-core

(Maximally) Asynchronous Today (Maximally) Asynchronous Processor 1 Processor 2 Processor P

Today Serial Equialence For all Data sets S For all data order π (data points can be arbitrarily repeated) Main advantage: - we only need to “prove” speedups - Convergence proofs inherited directly from serial Main Issue: - Serial equivalence too strict - Cannot guarantee any speedups in the general case

The Stochastic Updates Meta-algorithm

Stochastic Updates What does this solve?

Stochastic Updates: A family of ML Algorithms Many algorithms with sparse access patterns: SGD SVRG / SAGA Matrix Factorization word2vec K-means Stochastic PCA Graph Clustering … Data points Variables 1 2 n Can we parallelize under Serial Equivalence?

A graph view of Conflicts in Parallel Updates

The Update Conflict Graph An edge between 2 updates if they overlap

The Theorem [Krivelevich’14] Lemma: sample Lemma: Sample less than vertices (with/without replacement)

The Theorem [Krivelevich’14] Lemma: sample Lemma: Sample less than vertices (with/without replacement) Then, the induced sub-graph shatters

Even if the Graph was a Single Huge Conflict Component! The Theorem [Krivelevich’14] sample C.C. Lemma: Sample less than vertices (with/without replacement) Then, the induced sub-graph shatters, The largest connected component has size Even if the Graph was a Single Huge Conflict Component!

Building a Parallelization Framework out of a Single Theorem

Sample Batch + Connected Components Phase 1 sample Sample vertices

Sample Batch + Connected Components Phase 1 sample C.C. Sample vertices

Sample Batch + Connected Components Phase 1 sample C.C. Sample vertices Compute Conn. Components NOTE: No conflicts across groups! Max Conn. Comp = logn => n/(Δlogn) tiny components Yay! Good for parallelization No conflicts across groups = we can run Stochastic Updates on each of them in parallel!

Sample Batch + Connected Components Phase 1 sample C.C. Allocation Phase 1I Core1 Core 2 Core p Cores < Batch size / logn = n/(Δlogn)

Sample Batch + Connected Components Phase 1 sample C.C. Allocation Phase 1I Core1 Core 2 Core p Cores < Batch size / logn = n/(Δlogn) A Single Rule: Run the updates serially inside each connected component Automatically Satisfied since we give each conflict group to a single core.

Sample Batch + Connected Components Phase 1 sample C.C. Allocation Phase 1I Core1 Core 2 Core p Policy 1: Random allocation, good when cores << Batch size Policy II: Greedy min-weight allocation (80% as good as optimal (which is NP-hard)) A Single Rule: Run the updates serially inside each connected component Automatically Satisfied since we give each conflict group to a single core.

SU SU SU Sample Batch + Connected Components Phase 1 Allocation C.C. Allocation Phase 1I Core1 Core 2 Core p Asynchronous and Lock-free Stochastic Updates Phase III SU SU SU Core1 Core 2 Core p

Sample Batch + Connected Components Phase 1 sample C.C. Allocation Phase 1I Core1 Core 2 Core p Asynchronous and Lock-free Stochastic Updates Phase III SU SU SU Core1 Core 2 Core p Each core runs Asynchronously and Lock-free! (No communication!)

SU SU SU Sample Batch + Connected Components Phase 1 Allocation C.C. Allocation Phase 1I Core1 Core 2 Core p Asynchronous and Lock-free Stochastic Updates Phase III SU SU SU Core1 Core 2 Core p No memory Contention! Not during Reads, neither during Writes!

SU SU SU Sample Batch + Connected Components Phase 1 Allocation C.C. Allocation Phase 1I Core1 Core 2 Core p Asynchronous and Lock-free Stochastic Updates Phase III SU SU SU Core1 Core 2 Core p Wait for Everyone to Finish

SU SU SU Sample Batch + Connected Components Phase 1 Allocation C.C. Allocation Phase 1I Core1 Core 2 Core p Asynchronous and Lock-free Stochastic Updates Phase III SU SU SU Core1 Core 2 Core p Wait for Everyone to Finish

Cyclades guarantees Serially Equivalence But does it guarantee speedups?

This is as fast as it gets Sample Batch + Connected Components Phase 1 sample C.C. Allocation Phase 1I Core1 Core 2 Core p This is as fast as it gets Asynchronous and Lock-free Stochastic Updates Phase III SU SU SU Core1 Core 2 Core p Wait for Everyone to Finish

If Phase I and II are fast We are good. Sample Batch + Connected Components Phase 1 Serial Cost: sample C.C. Allocation Phase 1I Serial Cost: Core1 Core 2 Core p This is as fast as it gets Asynchronous and Lock-free Stochastic Updates Phase III Serial Cost: SU SU SU Core1 Core 2 Core p Wait for Everyone to Finish

Main Theorem Speedup = Note: Serial Cost = κ = cost / coordinate update Speedup = Assumptions: Not too large max degree (approximate “regularity”) Not too many cores Sampling according to the “Graph Theorem”

Identical performance as serial for SGD (convex/nonconvex) SVRG / SAGA (convex/ nonconvex) Weight decay Matrix Factorization Word2Vec Matrix Completion Dropout + Random coord. - Greedy Clustering Sample Batch + Connected Components Phase 1 sample C.C. Allocation Phase 1I Core1 Core 2 Core p Asynchronous and Lock-free Stochastic Updates Phase III SU SU SU Core1 Core 2 Core p Wait for Everyone to Finish

Fast Connected Components

Fast Connected Components sample C.C. If you have the conflict graph CC is easy… O(Sampled Edges) Building the conflict graph requires n^2 time… No, thanks.

Fast Connected Components Sample n/Δ left vertices Sample on the Bipartite, not on the Conflict Graphs

Simple Message passing Idea Fast Connected Components Simple Message passing Idea -Gradients send their IDs -Coordinates Compute Min and Send Back -Gradients Compute Min and Send Back Iterate till you’re done

Simple Message passing Idea Fast Connected Components 1 1 2 2 2 3 3 3 Simple Message passing Idea -Gradients send their IDs -Coordinates Compute Min and Send Back -Gradients Compute Min and Send Back Iterate till you’re done

Simple Message passing Idea Fast Connected Components 1 1 1 2 2 3 3 3 Simple Message passing Idea -Gradients send their IDs -Coordinates Compute Min and Send Back -Gradients Compute Min and Send Back Iterate till you’re done

Simple Message passing Idea Fast Connected Components 1 1 1 1 1 1 2 2 2 3 3 3 3 3 Simple Message passing Idea -Gradients send their IDs -Coordinates Compute Min and Send Back -Gradients Compute Min and Send Back Iterate till you’re done

Simple Message passing Idea Fast Connected Components 1 1 1 2 1 3 3 3 Simple Message passing Idea -Gradients send their IDs -Coordinates Compute Min and Send Back -Gradients Compute Min and Send Back Iterate till you’re done

Simple Message passing Idea Fast Connected Components 1 1 1 1 1 1 1 2 1 3 3 3 3 3 Simple Message passing Idea -Gradients send their IDs -Coordinates Compute Min and Send Back -Gradients Compute Min and Send Back Iterate till you’re done

Simple Message passing Idea Fast Connected Components 1 1 1 1 1 3 3 3 Simple Message passing Idea -Gradients send their IDs -Coordinates Compute Min and Send Back -Gradients Compute Min and Send Back Iterate till you’re done

Simple Message passing Idea Fast Connected Components 1 1 1 1 1 3 3 3 Simple Message passing Idea -Gradients send their IDs -Coordinates Compute Min and Send Back -Gradients Compute Min and Send Back Iterate till you’re done

Fast Connected Components 1 1 1 1 1 3 3 3 Cost = Same assumptions as main theorem

Proof of “The Theorem”

The Theorem Lemma: Activate each vertex with probability p = (1-ε) /Δ sample Lemma: Activate each vertex with probability p = (1-ε) /Δ Then, the induced subgraph shatters, and the largest connected component has size

DFS 101

DFS

DFS

DFS

DFS

DFS

DFS

DFS

DFS

Probabilistic DFS

DFS with random coins Algorithm: - Flip a coin for each vertex DFS wants to visit - If 1 visit, if 0 don’t visit and delete with its edges

DFS with random coins Algorithm: - Flip a coin for each vertex DFS wants to visit - If 1 visit, if 0 don’t visit and delete with its edges

DFS with random coins Algorithm: - Flip a coin for each vertex DFS wants to visit - If 1 visit, if 0 don’t visit and delete with its edges

DFS with random coins Algorithm: - Flip a coin for each vertex DFS wants to visit - If 1 visit, if 0 don’t visit and delete with its edges

DFS with random coins Algorithm: - Flip a coin for each vertex DFS wants to visit - If 1 visit, if 0 don’t visit and delete with its edges

DFS with random coins Algorithm: - Flip a coin for each vertex DFS wants to visit - If 1 visit, if 0 don’t visit and delete with its edges

DFS with random coins Algorithm: - Flip a coin for each vertex DFS wants to visit - If 1 visit, if 0 don’t visit and delete with its edges

DFS with random coins Algorithm: - Flip a coin for each vertex DFS wants to visit - If 1 visit, if 0 don’t visit and delete with its edges

DFS with random coins Algorithm: - Flip a coin for each vertex DFS wants to visit - If 1 visit, if 0 don’t visit and delete with its edges

DFS with random coins Algorithm: - Flip a coin for each vertex DFS wants to visit - If 1 visit, if 0 don’t visit and delete with its edges

DFS with random coins Algorithm: - Flip a coin for each vertex DFS wants to visit - If 1 visit, if 0 don’t visit and delete with its edges

DFS with random coins A Little extra trickery to turn this statement to a with or without replacement Theorem. - Say I have a connected component of size k - #random coins flipped “associated” to that component <= k *Δ - Since I have a size k component it means that I had the event “at least k coins are “ON” in a set of k * Δ coins”

Is Max.Degree really an issue?

High Degree Vertices (outliers) Say we have a graph with a low-degree component + some high degree vertices

High Degree Vertices (outliers) Lemma: sample Lemma: If you sample uniformly less than vertices Then, the induced subgraph of the low-degree (!) part shatters, and the largest connected component has size (whp) is the max degree is the low-degree subgraph

High Degree Vertices (outliers) sample

Experiments

Experiments Implementation in C++ Experiments on Intel Xeon CPU E7-8870 v3 1TB RAM SAGA SVRG L2-SGD SGD Full asynchronous (Hogwild!) vs CYCLADES

Speedups SVRG/SAGA ~ 500% gain SGD ~ 30% gain

Convergence Least Squares SAGA 16 threads

Open Problems O.P. : Can Cyclades handle Dense Data? maybe… Assumptions: Sparsity is Key O.P. : Can Cyclades handle Dense Data? maybe… Run HW updates on Dense and CYCLADES on sparse? O.P. : Data sparsification for f(<a,x>) problems? What other algorithms are in Stochastic Updates?

Open Problems Issues when scaling across nodes Asynchronous algorithms great for Shared Memory Systems Issues when scaling across nodes Similar Issues for Distributed: speedup #threads O.P. : How to provably scale on NUMA? O.P. : What is the right ML Paradigm for Distributed?

CYCLADES a framework for Parallel Sparse ML algorithms Lock-free + (maximally) Asynchronous No Conflicts Serializable Black-box analysis