TensorFlow– A system for large-scale machine learning

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Spark: Cluster Computing with Working Sets
Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Pregel: A System for Large-Scale Graph Processing
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
Pregel: A System for Large-Scale Graph Processing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Pregel: A System for Large-Scale Graph Processing Nov 25 th 2013 Database Lab. Wonseok Choi.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
TensorFlow CS 5665 F16 practicum Karun Joseph, A Reference:
John Canny Where to put block multi-vector algorithms?
Chapter 4: Threads.
Zheng ZHANG 1-st year PhD candidate Group ILES, LIMSI
Early Results of Deep Learning on the Stampede2 Supercomputer
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Yarn.
Tensorflow Tutorial Homin Yoon.
Distributed Shared Memory
基于多核加速计算平台的深度神经网络 分割与重训练技术
Chilimbi, et al. (2014) Microsoft Research
Spark Presentation.
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
PREGEL Data Management in the Cloud
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Deep Learning Libraries
Intro to NLP and Deep Learning
Software Engineering Introduction to Apache Hadoop Map Reduce
Overview of TensorFlow
A VERY Brief Introduction to Convolutional Neural Network using TensorFlow 李 弘
TensorFlow and Clipper (Lecture 24, cs262a)
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
湖南大学-信息科学与工程学院-计算机与科学系
Page Replacement.
Early Results of Deep Learning on the Stampede2 Supercomputer
An open-source software library for Machine Intelligence
MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,
Software models - Software Architecture Design Patterns
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Interpret the execution mode of SQL query in F1 Query paper
Multithreaded Programming
DryadInc: Reusing work in large-scale computations
CS703 - Advanced Operating Systems
CS639: Data Management for Data Science
TensorFlow: A System for Large-Scale Machine Learning
Deep Learning Libraries
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Time Zoya Yeprem.
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

TensorFlow– A system for large-scale machine learning Weize Sun

Introduction to TensorFlow Originated from Google DistBelief. Uses dataflow graphs to represent computation, shared states, and the operations that mutates that state. This paper focuses on neural network training as a challenging system problem

Previous system: DistBelief DistBelief adopts a parameter server architecture. A job comprises two disjoint sets of processes: worker process and parameter server process. Worker processes Stateless Performs the bulk of computation when training a model Parameter server process Stateful. Maintain the current version of the model parameters DistBelief is similar to Caffe’s: User defines a neural network as a DAG Terminate with a loss function

However, users want flexibility DistBelief’s layers are C++ classes, which maybe barrier for researcher seeking for new layer architectures. Researchers often want to experiment with new optimization models, but doing that in DistBelief involves modifying the parameter server implementation. DistBelief workers follow a fixed execution pattern. This pattern works for training simple feed-forward neural networks, but fails for more advanced models, such as recurrent neural networks, which contain loops.

Design principles Both TensorFlow and DistBelief use a dataflow for models. But DistBelief comprises relatively few complex “layers”, whereas the corresponding TensorFlow model represents individual mathematics operators as nodes. TensorFlow adopts deferred execution, which has two phases: The first phase defines the program as a symbolic dataflow graph. The second phase executes an optimized version of the program. There is no such thing as a parameter server. On a cluster, we deploy TensorFlow as a set of tasks

TensorFlow execution model TensorFlow differs from batch dataflow systems in two respects: The model supports multiple concurrent executions on overlapping subgraphs of the overall graph. Individual vertices may have mutable state that can be shared between different executions of the graph. Mutable state is crucial when training very large models, because it becomes possible to make in-place updates to very large parameters, and propagate those updates to parallel training steps as quickly as possible.

TensorFlow dataflow graph elements In a TensorFlow graph, each vertex represents a unit of local computation, and each edge represents the output from, or input to, a vertex. Tensor: In TensorFlow, we model all data as tensors(n-dimensional arrays) with the elements having one of a small number of primitive types. All TensorFlow tensors are dense at the lowest level(easy for memory allocation). 1 0 2 0 0 3 0 0 0 1 1 1 3 2 3 , 1 2 3 Operation: An operation takes m >= 0 tensors as input and produces n>= 0 tensors as output. An operation has a type. Variables(stateful operations): has no inputs, and produces a reference handle.

Partial and concurrent execution The API for executing a graph allows the client to specify declaratively the subgraph that should be executed. By default, concurrent execution of a TensorFlow subgraph run asynchronously with respect to one another. This asynchrony makes it straightforward to implement machine learning algorithms with weak consistency requirements.

Distributed execution Each operation resides on a particular device, such as a CPU or GPU in a particular task. A device is responsible for executing a kernel for each operation assigned to it. A per-device subgraph for device d contains all of the operations that were assigned to d, with additional Send and Recv operations that replace edges across device boundaries. Send & Recv Subgraph on a device Subgraph on another device

Dynamic control flow TensorFlow uses Switch and Merge to control the conditional flow The while loop is more complicated, and TensorFlow uses Enter, Exit, NextIteration operators to execute. Devices

Fault tolerance In a natural language processing job, TensorFlow has to take over 10^9 parameters with a vocabulary of 800000 words. TensorFlow implements sparse embedding layers in the TensorFlow graph as a composition of primitive operations. (Which is the dynamic control flow mentioned before) It is unlikely that tasks will fail so often that individual operations need fault tolerance. So Spark’s RDD helps less. Many learning algorithms do not require strong consistency.

TensorFlow fault tolerance TensorFlow implements user-level checkpointing for fault tolerance– Save writes one or more tensors to a checkpoint file, and Restore reads one or more tensors from a check point file. The checkpointing library does not attempt to produce consistent checkpoints, which is compatible with the relaxed guarantees of asynchronous SGD. m

Synchronous replica coordination There are three alternatives in the synchronization model: TensorFlow implement backup workers, which is similar to Mapreduce backup tasks.

Implementation A C API separates user-level code from the core runtime. The distributed master translates user requests into execution across a set of tasks The dataflow executor in each task handles a request from the master and schedules the execution of the kernels that comprise a local subgraph.

Evaluation 1: single machine benchmarks It is imperative that scalability does not mask poor performance at small scales…

Evaluation 2: synchronous replica It is imperative that scalability does not mask poor performance at small scales…

Further way to go Some users have begun to chafe at the limitation of a static dataflow graph, especially for algorithms like deep reinforcement learning. TensorFlow has not yet worked out a suitable default policies that works well for all users.