M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

Minerva: A Scalable and Highly Efficient Training Platform for Deep Learning
M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014) Presentation by Cameron Hamilton

Overview Problem: disparity between deep learning tools oriented towards productivity/generality (e.g. MATLAB) and task-specific tools designed for speed and scale (e.g. CUDA-Convnet). Solution: A matrix-based API, known as Minerva, with a MATLAB-like procedural coding style. Program is translated into an internal dataflow graph at runtime, which is generic enough to be implemented on different types of hardware.

Minerva System Overview
Every training iteration has two phases Generate dataflow graph from user code Evaluate dataflow graph

Example of User Code

System Overview: Performance via Parallelism
Performance of deep learning algorithms dependent on whether operations can be performed in parallel. Minerva utilizes two forms of parallelism: Model parallelism: model replicas used to train the same model Replicas exchange updates via “logically centralized parameter server” (p. 4). Data parallelism: model replicas assigned to different portions of the data sets Always evaluates on GPU if available

Programming Model Minerva API  3 stages for deep learning
Define model architecture Model model; Layer layer1 = model.AddLayer(dim); model.AddConnection(layer1,layer2,FULL); Declaring primary matrices (i.e. weights & biases) Matrix W = Matrix(layer2,layer1,RANDOM); Matrix b(layer2,1,RANDOM); Vector<Matrix> inputs = LoadBatches(layer1,…);

Programming Model Specifying training procedure
Convolutional neural networks (CNNs) are specified with a different syntax. The architecture is specified with a single line: AddConvConnect(layer1,layer2,…). Minerva then handles the arrangement of these layers (p.4).

Programming Model Expressing Parallelism Model Parallelism
SetPartition(layer1,2);SetPartition(layer2,2); Data Parallelism ParameterSet pset; pset.Add(“W”,W); pset.Add(“V”,V); pset.Add(“b”,b); pset.Add(“c”,c); RegisterToParameterServer(pset); …//Learning Procedure Here if(epoch % 3 == 0) PushToParameterServer(pset); if(epoch % 6 == 0) PullFromParameterServer(pset); EvalAll();

Putting it All Together

System Design: More on Parallelism
Within a neural network, the operations that will occur at each computing vertex (i.e. forward propagation, backward propagation, weight update) are predefined. This allows for network training to be partitioned for theoretically any number of threads. Updates shared between local parameter servers Load-balance by dividing task up amongst partitions Coordination and Overhead by determining ownership of computing vertex based on location of its input and output vertices. Partitions stick to their vertices. Locality by receiving input to vertex in layer n from n-1 and outputting layer n+1

Model Parallelism

Convolutional Networks
Partitions handle patches of the input data, the patches are merged, then convolved with a kernel.

More on Data Parallelism
Each machine/partition has its own local parameter server that updates and exchanges with its neighbor servers. Coordination done through belief-propagation-like algorithm (p.7) Merge updates with neighbors, then server “gossips to each of them the missing portion”

Experience and Evaluation
Minerva Implementation Highlights ImageNet (CNN) 1K classification task (Krzhevsky et al., 2012) 42.7% top-1 error rate 15x faster than MATLAB implementation 4.6x faster with 16-way partition on 16 core machine than no partitions Speech-net 1100 input neurons, 2000 sigmoid neurons x 8 hidden layers, 9000 softmax output layer 1.5-2x faster than MATLAB implementation RNN 10000 input, 1000 hidden, flat outputs

Scaling Up (Figure 8): CNN using mini-batch size of 128 Minerva(GPU) trained faster than Caffee using 256 and 512 mini-batch sizes

Conclusion Powerful and versatile framework for big data and deep learning Pipeline may be more preferable than partitioned fully connected layers which cause traffic My Comments Lacks restricted Boltzmann machine (RBM) so deep belief network (DBN) is not currently possible API appears to be concise and readable Lacks implementation of algorithm for genetic design of network (e.g. NEAT), however population generation would be ideal for partitioning. Not clear how Minerva handles situations where partitions do not evenly divide # of nodes within a given layer

References Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp ). Wang, M., Xiao, T., Li, J., Zhang, J., Hong, C., & Zhang, Z. (2014). Minerva: A scalable and highly efficient training platform for deep learning. All figures appearing within this presentation are borrowed from Wang et al., 2014.

M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

Similar presentations

Presentation on theme: "M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

Similar presentations

Presentation on theme: "M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)"— Presentation transcript:

Similar presentations

About project

Feedback