Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Persistent Linda 3.0 Peter Wyckoff New York University.
Spark: Cluster Computing with Working Sets
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
1 Machine Learning with Apache Hama Tommaso Teofili tommaso [at] apache [dot] org.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
1 Complexity of Network Synchronization Raeda Naamnieh.
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible,
Distributed Computations
Distributed Systems Fall 2011 Gossip and highly available services.
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Distributed Computations MapReduce
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
On Availability of Intermediate Data in Cloud Computations Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta Distributed Protocols Research Group.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
MapReduce.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Introduction to Hadoop and HDFS
High Throughput Computing on P2P Networks Carlos Pérez Miguel
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
CS 5204 (FALL 2005)1 Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency Gray and Cheriton By Farid Merchant Date: 9/21/05.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
Scalable and Coordinated Scheduling for Cloud-Scale computing
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Global Clock Synchronization in Sensor Networks Qun Li, Member, IEEE, and Daniela Rus, Member, IEEE IEEE Transactions on Computers 2006 Chien-Ku Lai.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Managed Communication and Consistency for Fast Data- Parallel Iterative Analytics Jinliang WeiWei DaiAurick QiaoQirong HoHenggang Cui Gregory R. GangerPhillip.
Scaling Distributed Machine Learning with the Parameter Server By M. Li, D. Anderson, J. Park, A. Smola, A. Ahmed, V. Josifovski, J. Long E. Shekita, B.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
TensorFlow– A system for large-scale machine learning
Large-scale file systems and Map-Reduce
Chilimbi, et al. (2014) Microsoft Research
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
PREGEL Data Management in the Cloud
Consistency in Distributed Systems
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
湖南大学-信息科学与工程学院-计算机与科学系
Providing Secure Storage on the Internet
EECS 498 Introduction to Distributed Systems Fall 2017
DHT Routing Geometries and Chord
Logistic Regression & Parallel SGD
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE, BAIDU, CARNEGIE MELLON UNIVERSITY INCLUDED IN THE PROCEEDINGS AT OSDI 14 PRESENTATION BY: SANCHIT GUPTA

Data is Large and Increasing!

Characteristics & Challenges of ML jobs Training Data is Large – 1TB to 1PB Complex Models with Billions and Trillions of Parameters Parameters are shared globally among worker nodes: Accessing them incurs large Network costs Sequential ML jobs require barriers and hurt performance by blocking At scale, Fault Tolerance is required as these jobs run in a cloud environment where machines are unreliable and jobs can be preempted

Key Goals and Features of Design Efficient Communication: asynchronous communication model (does not block computation) Flexible Consistency Models: Algorithm designer can balance algorithmic convergence and system efficiency Elastic Scalability : New nodes can be added without restarting framework Fault Tolerance and Durability: Recovery from and repair in 1 sec. Ease of Use: easy for users to write programs

Architecture & Design Details

Architecture: Data and Model Training Data Model Worker Machines Server Machines Push Pull Work Resource Manager Server Manager Task Scheduler

Example: Distributed gradient Descent Server Machines Worker Workers get the Assigned training data Workers Pull the Working set of Model Iterate until Stop: Workers Compute Gradients Workers Push Gradients Servers Aggregate into current model Workers Pull updated model Training Data Work

Architecture: Parameter Key-Value Model Parameters are represented as Key – Value pairs Parameter Server approach models the Key-Value pairs as sparse Linear Algebra Objects. Batch several key-value pairs required to compute a vector/matrix instead of sending them one by one Easy to Program! – Lets us treat the parameters as key-values while endowing them with matrix semantics

Architecture: Range Push and Pull Data is sent between Workers and Servers using PUSH and PUSH operations. Parameter Server optimizes updates communication by using RANGE based PUSH and PULL. Example: Let w denote parameters of some model w.push(Range, dest) w.pull(Range, dest) These methods will send/receive all existing entries of w with keys in Range

Architecture: Asynchronous tasks and Dependency Challenges for Data Synchronization: There is a MASSIVE communication traffic due to frequent access of Shared Model Global barriers between iterations – leads to: idle workers waiting for other computation to finish High total finish time

Architecture: Flexible Consistency Can change the consistency model for the system, as per the requirements of the job Up to the algorithm designer to choose the flexible consistency model Trade-off between Algorithm Efficiency and System Performance

Architecture: User Defined Filters Selectively Synchronize (key, value) pairs. Filters can be placed at either or both the Server machines and Worker machines. Allows for fine-grained consistency control within a task Example: Significantly modified filter: Only pushes entries that have changed for more than an amount.

Implementation: Vector Clocks & Messaging Vector Clocks are attached for each (Key, value) pairs for several purposes: Tracking Aggregation Status Rejecting doubly sent data Recovery from Failure As many (key, value) pairs get updated at the same time during one iteration, they can share the same clock stamps. This reduces the space requirements. Messages are sent in Ranges for efficient lookup and transfers. Messages are compressed using Google’s Snappy compression library.

Implementation: Consistent Hashing & Replication The parameter server partitions the keys on to the Servers using Ranged Partitioning. The Servers are themselves hashed to a virtual ring similar to Chord. Server nodes store a replica of (Key, value) pairs in k nodes counter clockwise to it.

Evaluation

Evaluation: Sparse Logistic Regression One of the most popular large scale Risk Minimization algorithm. For example in the case of ads prediction, we want to predict the revenue an ad will generate. It can be done by running a logistic regression on the available data for ads which are ‘close to’ the ad we want to post. The experiment was run with: 170 billion examples 65 billion unique features 636 TB of data in total 1000 machines: 800 workers & 200 servers Machines: 16 cores, 192 GB DRAM, and 10 Gb Ethernet links

Summary and Discussion

Summary: Pros Efficient Communication: Batching (key,value) pairs in Linear Algebra objects Filters to reduce unnecessary communication & message compression Caching keys at worker and server nodes for local access Flexible Consistency Models: Can choose between Sequential, Eventual, and Bounded delay consistency models Allows for tradeoffs between System Performance and Algorithmic Convergence Fault Tolerance and Durability: Replication of data in Servers Failed workers can restart at the point of failure by using vector clocks Ease of Use: Linear Algebra objects allow for intuitive implementation of tasks

Summary: Cons & Further Discussion What are System A and System B? No insight into design differences. Server Manager Failures and Task Scheduler failures are not discussed. No experiments on the other two systems with Bounded delay model. System B’s waiting time may reduce if implemented with a Bounded Delay model.