SGD ON HADOOP FOR BIG DATA & HUGE MODELS Alex Beutel Based on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos.

Slides:



Advertisements
Similar presentations
BiG-Align: Fast Bipartite Graph Alignment
Advertisements

Overview of this week Debugging tips for ML algorithms
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
SCALING SGD to Big dATA & Huge Models
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Spark: Cluster Computing with Working Sets
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
1 Large-Scale Machine Learning at Twitter Jimmy Lin and Alek Kolcz Twitter, Inc. Presented by: Yishuang Geng and Kexin Liu.
Learning with Hadoop – A case study on MapReduce based Data Mining Evan Xiang, HKUST 1.
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
Cloud Computing Lecture #3 More MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, September 10, 2008 This work is licensed under a Creative.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Distributed Computations MapReduce
Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Collaborative Filtering Matrix Factorization Approach
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
MapReduce How to painlessly process terabytes of data.
Dimensionality Reduction Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
PETUUM A New Platform for Distributed Machine Learning on Big Data
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Large Scale Distributed Distance Metric Learning by Pengtao Xie and Eric Xing PRESENTED BY: PRIYANKA.
Dimensionality Reduction
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Solving the straggler problem with bounded staleness Jim Cipar, Qirong Ho, Jin Kyu Kim, Seunghak Lee, Gregory R. Ganger, Garth Gibson, Kimberly Keeton*,
CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks (WWW2013) BEUTEL, ALEX, WANHONG XU, VENKATESAN GURUSWAMI, CHRISTOPHER.
Matrix Factorization 1. Recovering latent factors in a matrix m columns v11 … …… vij … vnm n rows 2.
Big Data Infrastructure Week 8: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Matrix Factorization Reporter : Sun Yuanshuai
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
StingyCD: Safely Avoiding Wasteful Updates in Coordinate Descent
Collaborative Filtering for Streaming data
PEGASUS: A PETA-SCALE GRAPH MINING SYSTEM
Matrix Factorization.
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
COS 518: Advanced Computer Systems Lecture 12 Mike Freedman
Advanced Artificial Intelligence
Cse 344 May 4th – Map/Reduce.
Collaborative Filtering Matrix Factorization Approach
Logistic Regression & Parallel SGD
Charles Tappert Seidenberg School of CSIS, Pace University
Panel on Research Challenges in Big Data
Recommender Systems Problem formulation Machine Learning.
Presentation transcript:

SGD ON HADOOP FOR BIG DATA & HUGE MODELS Alex Beutel Based on work done with Abhimanu Kumar, Vagelis Papalexakis, Partha Talukdar, Qirong Ho, Christos Faloutsos, and Eric Xing

Outline 1. When to use SGD for distributed learning 2. Optimization Review of DSGD SGD for Tensors SGD for ML models – topic modeling, dictionary learning, MMSB 3. Hadoop 1. General algorithm 2. Setting up the MapReduce body 3. Reducer communication 4. Distributed normalization 5. “Always-On SGD” – How to deal with the straggler problem 4. Experiments

When distributed SGD is useful Collaborative Filtering Predict movie preferences Topic Modeling What are the topics of webpages, tweets, or status updates Dictionary Learning Remove noise or missing pixels from images Tensor Decomposition Find communities in temporal graphs 300 Million Photos uploaded to Facebook per day! 1 Billion users on Facebook 400 million tweets per day

Gradient Descent

Stochastic Gradient Descent (SGD)

SGD Background

DSGD for Matrices (Gemulla, 2011) X U V ≈ Users Movies Genres

DSGD for Matrices (Gemulla, 2011) X U V ≈ Independent!

DSGD for Matrices (Gemulla, 2011) Independent Blocks

DSGD for Matrices (Gemulla, 2011) Partition your data & model into d × d blocks Results in d=3 strata Process strata sequentially, process blocks in each stratum in parallel

TENSORS

What is a tensor? Tensors are used for structured data > 2 dimensions Think of as a 3D-matrix Subject Verb Object For example: Derek Jeter plays baseball

Tensor Decomposition ≈ U V W X

≈ U V W X

≈ U V W X

≈ U V W X Independent Not Independent

Tensor Decomposition

For d=3 blocks per stratum, we require d 2 =9 strata

Tensor Decomposition

Coupled Matrix + Tensor Decomposition X Y Subject Verb Object Document

Coupled Matrix + Tensor Decomposition ≈ U V W X Y A

CONSTRAINTS & PROJECTIONS

Example: Topic Modeling Documents Words Topics

Constraints Sometimes we want to restrict response: Non-negative Sparsity Simplex (so vectors become probabilities) Keep inside unit ball

How to enforce? Projections Example: Non-negative

More projections Sparsity (soft thresholding): Simplex Unit ball

Dictionary Learning Learn a dictionary of concepts and a sparse reconstruction Useful for fixing noise and missing pixels of images Sparse encoding Within unit ball

Mixed Membership Network Decomp. Used for modeling communities in graphs (e.g. a social network) Simplex Non-negative

IMPLEMENTING ON HADOOP

High level algorithm for Epoch e = 1 … T do for Subepoch s = 1 … d 2 do Let be the set of blocks in stratum s for block b = 1 … d in parallel do Run SGD on all points in block end Stratum 1 Stratum 2 Stratum 3 …

Bad Hadoop Algorithm: Subepoch 1 Run SGD on Update: Run SGD on Update: Run SGD on Update: ReducersMappers U2U2 V1V1 W3W3 U3U3 V2V2 W1W1 U1U1 V3V3 W2W2

Bad Hadoop Algorithm: Subepoch 2 Run SGD on Update: Run SGD on Update: Run SGD on Update: ReducersMappers U2U2 V1V1 W2W2 U3U3 V2V2 W3W3 U1U1 V3V3 W1W1

Bad Hadoop Algorithm U2U2 V1V1 W3W3 Run SGD on Update: U3U3 V2V2 W1W1 Run SGD on Update: U1U1 V3V3 W2W2 Run SGD on Update: ReducersMappers

Hadoop Challenges MapReduce is typically very bad for iterative algorithms T × d 2 iterations Sizable overhead per Hadoop job Little flexibility

High Level Algorithm V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 U1U1 V1V1 W1W1 U2U2 V2V2 W2W2 U3U3 V3V3 W3W3

V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 U1U1 V1V1 W1W1 U2U2 V2V2 W2W2 U3U3 V3V3 W3W3

V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 U1U1 V1V1 W3W3 U2U2 V2V2 W1W1 U3U3 V3V3 W2W2

V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 V1V1 V2V2 V3V3 U1U1 U2U2 U3U3 W1W1 W2W2 W3W3 U1U1 V1V1 W2W2 U2U2 V2V2 W3W3 U3U3 V3V3 W1W1

Hadoop Algorithm Process points: Map each point to its block with necessary info to order U1U1 V1V1 W1W1 Run SGD on Update: U2U2 V2V2 W2W2 Run SGD on Update: U3U3 V3V3 W3W3 Run SGD on Update: Reducers Mappers Partition & Sort … … HDFS

Hadoop Algorithm Process points: Map each point to its block with necessary info to order Reducers Mappers Partition & Sort Use: Partitioner KeyComparator GroupingComparator

Hadoop Algorithm Process points: Map each point to its block with necessary info to order Reducers Mappers Partition & Sort … …

Hadoop Algorithm Process points: Map each point to its block with necessary info to order U1U1 V1V1 W1W1 Run SGD on Update: U2U2 V2V2 W2W2 Run SGD on Update: U3U3 V3V3 W3W3 Run SGD on Update: Reducers Mappers … … Partition & Sort

Hadoop Algorithm Process points: Map each point to its block with necessary info to order U1U1 V1V1 W1W1 Run SGD on Update: U2U2 V2V2 W2W2 Run SGD on Update: U3U3 V3V3 W3W3 Run SGD on Update: Reducers Mappers Partition & Sort … …

Hadoop Algorithm Process points: Map each point to its block with necessary info to order U1U1 V1V1 Run SGD on Update: U2U2 V2V2 Run SGD on Update: U3U3 V3V3 Run SGD on Update: Reducers Mappers Partition & Sort … … HDFS W2W2 W1W1 W3W3

Hadoop Algorithm Process points: Map each point to its block with necessary info to order U1U1 V1V1 W1W1 Run SGD on Update: U2U2 V2V2 W2W2 Run SGD on Update: U3U3 V3V3 W3W3 Run SGD on Update: Reducers Mappers Partition & Sort … … HDFS

Hadoop Summary 1. Use mappers to send data points to the correct reducers in order 2. Use reducers as machines in a normal cluster 3. Use HDFS as the communication channel between reducers

Distributed Normalization Documents Words Topics π1π1 β1β1 π2π2 β2β2 π3π3 β3β3

Distributed Normalization π1π1 β1β1 π2π2 β2β2 π3π3 β3β3 σ (1) σ (2) σ (3) σ (b) is a k-dimensional vector, summing the terms of β b σ (1) σ (3) σ (2) Transfer σ (b) to all machines Each machine calculates σ: Normalize:

Barriers & Stragglers Process points: Map each point to its block with necessary info to order Run SGD on Reducers Mappers Partition & Sort … … U1U1 V1V1 Update: U2U2 V2V2 U3U3 V3V3 HDFS W2W2 W1W1 W3W3 Wasting time waiting!

Solution: “Always-On SGD” For each reducer: Run SGD on all points in current block Z Shuffle points in Z and decrease step size Check if other reducers are ready to sync Run SGD on points in Z again If not ready to sync Wait If not ready to sync Sync parameters and get new block Z

“Always-On SGD” Process points: Map each point to its block with necessary info to order Run SGD on Reducers Partition & Sort … … U1U1 V1V1 Update: U2U2 V2V2 U3U3 V3V3 HDFS W2W2 W1W1 W3W3 Run SGD on old points again!

“Always-On SGD” First SGD pass of block Z Extra SGD Updates Read Parameters from HDFS Write Parameters to HDFS Reducer 1 Reducer2 Reducer 3 Reducer 4

EXPERIMENTS

FlexiFaCT (Tensor Decomposition) Convergence

FlexiFaCT (Tensor Decomposition) Scalability in Data Size

FlexiFaCT (Tensor Decomposition) Scalability in Tensor Dimension Handles up to 2 billion parameters!

FlexiFaCT (Tensor Decomposition) Scalability in Rank of Decomposition Handles up to 4 billion parameters!

FlexiFaCT (Tensor Decomposition) Scalability in Number of Machines

Fugue (Using “Always-On SGD”) Dictionary Learning: Convergence

Fugue (Using “Always-On SGD”) Community Detection: Convergence

Fugue (Using “Always-On SGD”) Topic Modeling: Convergence

Fugue (Using “Always-On SGD”) Topic Modeling: Scalability in Data Size

Fugue (Using “Always-On SGD”) Topic Modeling: Scalability in Rank

Fugue (Using “Always-On SGD”) Topic Modeling: Scalability over Machines

Fugue (Using “Always-On SGD”) Topic Modeling: Number of Machines

Fugue (Using “Always-On SGD”)

Key Points Flexible method for tensors & ML models Can use stock Hadoop through using HDFS for communication When waiting for slower machines, run updates on old data again

Questions? Alex Beutel