High Performance Parallel Stochastic Gradient Descent

High Performance Parallel Stochastic Gradient Descent
IN SHARED MEMORY 1,2Scott Sallinen, 2Nadathur Satish, 2Mikhail Smelyanskiy, 2Samantika Sury, 3Christopher Ré 1University of British Columbia, 2Intel Corporation, 3Stanford University

Overview of Regression and Stochastic gradient descent
Parallelizing SGD Overview of Regression & SGD Experimental Results Comparison to State of the Art Overview of Regression and Stochastic gradient descent

Regression Goal of regression is to model and analyze data
Many types: Linear, Polynomial, Least Squares, Logistic… Generally Sparse, Convex Problems Training Strategy is representative of many machine learning training techniques

Single Model Regression
Want to create a model M, based on input dataset X, and their corresponding training labels Y. # Features D 1 # Features D X Y 1 M # Samples N # Samples N Dataset Labels Model (weights)

Stochastic Gradient Descent
Stochastic Gradient Method [Robbins & Monro, 1951] Select index i 𝑀 𝑡+1 = 𝑀 𝑡 − 𝑎 𝑡 ∗ 𝑔 𝑖 ( 𝑀 𝑡 ) Much faster than full gradient descent Note: Each model is a direct descendant of the previous. # Features D 1 # Features D i X Y 1 M # Samples N # Samples N Dataset Labels Model (weights)

Stochastic Gradient Descent (Convex)
Training looks something like this: Source: Adam Harley

Many faces of SGD Not only for Logistic regression. Select Sample(s)
Other Regressions Support Vector Machines Least Mean Squares Neural Network Training And more... There are many options for each step. Select Sample(s) Compute Gradient Update Model

Parallelizing Stochastic Gradient Descent
Parallelizing SGD Overview of Regression & SGD Experimental Results Comparison to State of the Art Parallelizing Stochastic Gradient Descent

Opportunities for Parallelism
Typically problems (data sets) are sparse. Amount of parallelism for the vector computation is dynamic (number of non- zeros in the given row), and is typically small: using threads here is non-ideal. We should parallelize across samples. Select Sample(s): ~ Compute Gradient: Logistic Function Update Model: Learning Rate (𝛼/sqrt(iter)) All Samples (GD) 1 Sample (SGD) S Samples (MINIBATCH) T Samples (HOGWILD)

Mini-Batching Select index b 𝑀 𝑡+𝑆 = 𝑀 𝑡 − 𝑎 𝑡 ∗ 𝑖=𝑏 𝑏+𝑆 𝑔 𝑖 ( 𝑀 𝑡 )
𝑀 𝑡+𝑆 = 𝑀 𝑡 − 𝑎 𝑡 ∗ 𝑖=𝑏 𝑏+𝑆 𝑔 𝑖 ( 𝑀 𝑡 ) Convergence between full Gradient Descent, and stochastic, depending on size of the batch. Parallelize across samples within the batch. (Similar to SpMDV). # Features D 1 # Features D X Y 1 M b S T1 T2 # Samples N # Samples N Dataset Labels Model (weights)

Mini-Batching: Key Aspects
 One model update per batch size  Private thread gradient vectors  Reduction  Thread barrier between updates (synchronization steps)  No conflicts, but “stale”: Update is a descendant of the model from the previous batch, rather than previous sample.  # Features D 1 # Features D X Y 1 M b S T1 T2 # Samples N # Samples N Dataset Labels Model (weights)

Hogwild Each thread: Select index i 𝑀 𝑡+1 = 𝑀 𝑡 − 𝑎 𝑡 ∗ 𝑔 𝑖 ( 𝑀 𝑡 )
𝑀 𝑡+1 = 𝑀 𝑡 − 𝑎 𝑡 ∗ 𝑔 𝑖 ( 𝑀 𝑡 ) Each model update not guaranteed a direct descendant. Write-races acceptable since the algorithm “refines” anyway. However when sparse updates touch different indices in the weight vector, there is no dependency. # Features D 1 # Features D X Y 1 M # Samples N # Samples N Dataset Labels Model (weights)

Hogwild: Key Aspects # Features D 1 # Features D  Thread asynchronicity  No reductions: async. SpVDV  One update for every sample  Potential for conflict…  … But not stale due to sharing the model  Cross-core false sharing of the model vector X Y 1 M False sharing: Indices reside on the same cache line, but are not actually conflicting. # Samples N # Samples N Dataset Labels Model (weights) Parallel for-all print  

Summary: Hogwild vs Mini-Batching
 Thread asynchronicity  One update for every sample  No reductions, async. update  Cross core sharing model vector  Potential for conflict, but not stale  Thread barrier between updates  One update per batch size  Reduction  Private gradient vectors  No conflicts, but stale unify

Introducing: Hogwild Batching (HogBatching)
Each thread: Select index b 𝑀 𝑡+𝑆 = 𝑀 𝑡 − 𝑎 𝑡 ∗ 𝑖=𝑏 𝑏+𝑆 𝑔 𝑖 ( 𝑀 𝑡 ) Parallelize across batches instead of samples. Statistical Efficiency (Convergence): At best as good as Hogwild, at worst as poor as Mini-Batching. # Features D 1 # Features D X Y 1 M S T1 S T2 # Samples N # Samples N S T3 Dataset Labels Model (weights)

HogBatching: Hierarchal Parallelism
To extend to many-core, we need to map more parallelism. # Features D 1 # Features D X Y 1 M S S Solution: Divide parallelism to inner and outer parallelism. # Samples N # Samples N There are many HogBatches running asynchronously S The internals of a HogBatch is just a small SGD problem Dataset Labels Model (weights)

HogBatching: Hierarchal Parallelism
Groups work on batches as Outer parallelism (parallel across batches). The entire group applies only one update to the Model vector. Workers within the groups perform Inner parallelism, in the form of a small Mini-Batch (parallel across samples). SIMD within the workers process the vector of the sample (parallel across elements). Perfect for multiple threads on the same core and cache (e.g. hyperthreading)

HogBatch: Note on Algorithm Identities
Creates nice ‘bridge’ or identity between the three methods: HogBatch, with a batch size of 1, is just Hogwild. HogBatch, with outer parallelism of 1, is just Mini-Batching. Further: All three methods are functionally equivalent to Serial SGD when executed with one thread. HogBatching is a general solution, in between the previous two methods. unify

Experimental Results Overview of Regression & SGD Parallelizing SGD
Comparison to State of the Art Experimental Results

Parallel scaling of Strategies Sparse Problem (0.155% nz)
RCV1-test dataset. Good case for Hogwild Poor case for batching Large feature size - Large reduction - Large dense update ? Features (column) tend to be power law. Sparse does not imply uniformly sparse Dataset Name Examples Features NNZ Sparsity (%) NNZ/Row Range Avg NNZ/Row rcv1-test 677,399 47,236 49,556,258 0.155 4 to 1,224 73.157

Parallel scaling of Strategies Sparse Problem (0.155% nz)
Raw “time to solution” Time-per-Datapass x Number-of-Datapasses Impact of measuring it this way is important! Mini-Batching has excellent hardware efficiency (low time-per-datapass). Hogwild has excellent statistical efficiency (low number-of-datapasses). … but each suffers in the other aspect. HogBatching takes the best of both worlds for a 2.4x improvement. Speedup compared to serial, on ‘effective time to solution’ of 99.5% closeness to optimal. This is a product of both statistical and hardware efficiency. Dataset Name Examples Features NNZ Sparsity (%) NNZ/Row Range Avg NNZ/Row rcv1-test 677,399 47,236 49,556,258 0.155 4 to 1,224 73.157

Parallel scaling of Strategies Denser Problem (22% nz)
Raw “time to solution” Time-per-Datapass x Number-of-Datapasses Hogwild suffered in hardware efficiency due to false sharing a small model. Mini-Batching suffered in hardware efficiency due to synchronizing threads constantly during small and rapid batches. (Note: A larger batch caused oscillations or poorer performance). HogBatching suffers no hardware inefficiencies: achieving a 20x performance improvement. Speedup compared to serial, on ‘effective time to solution’ of 99.5% closeness to optimal. This is a product of both statistical and hardware efficiency. Constant write pressure on shared, small weight vector (size of #Features). Batches must be small to get good convergence. Batches are completed quickly due to small feature size. Dataset Name Examples Features NNZ Sparsity (%) NNZ/Row Range Avg NNZ/Row covtype 581,012 54 6,940,438 22.121 9 to 12 11.945

Comparison to state-of-the-art
Parallelizing SGD Overview of Regression & SGD Experimental Results Comparison to State of the Art Comparison to state-of-the-art

BIDMach State-of-the-art framework making noise for the GPU being much faster than CPU solutions.

Single Model Comparison
Comparison to BIDMach Apples to Apples comparison.* We discussed with the authors, and were told that single model is not an optimized case for BIDMach. * We set all parameters (learning rate, batch size, regularization) to the same as BIDMach. We run BIDMach on our machine. We use the ADAGRAD update, since it is what they use. We discussed with the authors to ensure we were evaluating correctly. Implementation Hardware Time/Pass (ms) BidMach TITAN X 723 Sandy Bridge 14,190 Intel (Batching) 289 Intel (Hogwild) 253 Intel (HogBatching) 147 Haswell 111 Dataset: RCV1-V2, 1 model, ADAGRAD update. Single precision. 1 Socket CPU. Dataset Name Examples Features NNZ Sparsity (%) NNZ/Row Range Avg NNZ/Row RCV1-V2 781,265 276,544 60,534,218 0.028 4 to 1585 77.482

Multi Model Comparison
Comparison to BIDMach Apples to Apples comparison. Time per pass per model begins to level off as SIMD units become saturated. 2 CPUs, ~64 models each: very good scaling! This also shows the importance of using strong baselines in CPU vs GPU comparisons. Hardware Models Time/Pass (ms) BidMach TITAN X 103 2,170 Sandy Bridge 120,720 Intel 2,010 Haswell 1,283 2x Haswell 724 Dataset: RCV1-V2. Time per pass for 103 models. Single precision. 1 Socket CPU except the last result. Dataset Name Examples Features NNZ Sparsity (%) NNZ/Row Range Avg NNZ/Row RCV1-V2 781,265 276,544 60,534,218 0.028 4 to 1585 77.482

Questions? Lessons Learned
Hardware Efficiency is extremely important during training phases, and needs to be considered as well. Don’t only focus on Statistical Efficiency, as total time is a product of both: “time-per-datapass” x “number-of-datapasses” When bounded by inter-core communication, using both privatization and asynchronicity together is key to improving performance. Questions? Contact:

Backup slides

Multi Model Regression
Parallelizing SGD Overview of Regression & SGD Experimental Results Comparison to State of the Art Multi Model Regression Multi Model Regression

SGD, Multi Model Case Select index i Easy to parallelize X Y M
𝑚=1 𝑀 𝑀 𝑚 [𝑡+1] = 𝑀 𝑚 [𝑡] − 𝑎 𝑡 ∗ 𝑔 𝑚 [𝑖] ( 𝑀 𝑚 [𝑡] ) Easy to parallelize # Features D m # Features D X Y M m # Samples N Dataset Labels Models (weights)

Each written index in the model is SIMD friendly.
SGD, Multi Model Case Easy to parallelize SIMD units work across static models, instead of across dynamically sized vectors. Batching is no longer a viable strategy due to increased size of labels and models. They are dense, and now 2 dimensional, no longer able to be duplicated as thread private temporary vectors which fit in core cache. HogBatching is not useful.  # Features D m # Features D X Y M m # Samples N Each written index in the model is SIMD friendly. Dataset Labels Models (weights)

Multi Model Scaling Parallelism increase offers near- linear scaling until a saturation at ~16 models. Also shown is the ADAGRAD update. Convergence comparisons and arguments for which is better, is best left for another time. Note: Log-Log Axis. Single precision. Dataset Name Examples Features NNZ Sparsity (%) NNZ/Row Range Avg NNZ/Row RCV1-V2 781,265 276,544 60,534,218 0.028 4 to 1585 77.482

Calculating error: Logistic Regression
A common model to perform Regression Logistic Function: 1 1+ 𝑒 −𝑡

Average Gradient (SAG)
Many faces of SGD Application of gradient to model weights. Select Sample(s) Learning rate (SGD) Compute Gradient ADAGRAD Average Gradient (SAG) Update Model Momentum …

Support Vector Machine
Many faces of SGD Gradient Computation. Select Sample(s) Linear Function Compute Gradient Logistic Function Support Vector Machine Update Model Least Squares …

Many faces of SGD How we choose samples to update. Select Sample(s)
All Samples (GD) Compute Gradient 1 Sample (SGD) B Samples (BATCH SGD) Update Model T Samples (HOGWILD)

Single Model Regression
Deterministic Gradient Method [Cauchy, 1847] 𝑀 𝑡+1 = 𝑀 𝑡 − 𝑎 𝑡 𝑁 ∗ 𝑖=1 𝑁 𝑔 𝑖 ( 𝑀 𝑡 ) Iteration cost linear in N x D # Features D 1 # Features D X X Y Y 1 M # Samples N # Samples N Dataset Labels Model (weights)

Serial SGD: Application Pattern
Semi - expanded algorithm (Logistic Regression) Gather for (index = 0; index < Samples; index++) { } ** regularization is not shown for simplicity for (non-zero indices j of X[index]) { dotProduct += X[index][j] * model[j] } Apply g = y[index] / 1 + exp(y[index] * dotProduct) update = a/sqrt(index) * g for (non-zero indices j of X[index]) { model[j] = model[j] – update } Scatter

Serial SGD: Dependency Pattern
Expanded Logistic Regression # Features D 1 # Features D X Y 1 M for (index = 0; index < Samples; index++) { for (non-zero indices j of X[index]) { dotProd += X[index][j] * model[j] } g = y[index] / 1 + exp(y[index] * dotProd) update = a/sqrt(index) * g model[j] = model[j] – update ** regularization is not shown for simplicity # Samples N # Samples N Update only touches sparse indices in the vector Dataset Labels Model (weights)

Batching Model is read-only during the batch
for (st = 0; st < num_samples/SIZE; st += SIZE) { #pragma omp parallel for for (index = st; index < SIZE; index++) { // Sparse vector operation. g_tid[TID] += a * Gradient(model, index) } // (implicit thread barrier) for (f = 0; f < num_features; f++) { for (t = 0; t < NUM_THREADS; t++) model[f] = model[f] - g_tid[t][f] } Model is written once at the end of the batch, by all threads

Introducing: HogBatching
Model is read-only (to current thread) during the batch #pragma omp parallel for schedule(dynamic) for (st = 0; st < num_samples/SIZE; st += SIZE) { for (index = st; index < SIZE; index++) { // Sparse vector operation. g_tid[TID] += a * Gradient(model, index) } for (f = 0; f < num_features; f++) model[f] = model[f] - g_tid[TID][f] Model is written once at the end of the batch by the thread

Across datasets HogBatch offers speedup across most datasets that we tested on, varying sparsity of 3 orders of magnitude. Speedup trend seems correlated to smaller feature size and denser problems. No free lunch! Pure Hogwild can be a better solution for massively sparse problems with an obnoxiously large feature size. Dataset # Features Sparsity (%) Speedup vs Best news20 1,355,191 0.034 0.86x rcv1-v2 276,544 0.028 1.87x rcv1-test 47,236 0.155 2.43x real-sim 20,958 0.245 3.85x w8a 300 3.884 8.97x connect4 126 33.333 5.81x covtype 54 22.121 20.16x Speedup of HogBatch over best alternative solution of Serial, SGD Batch, or Hogwild. Run with all 28 threads on 14 cores.

Staleness of Updates For best convergence, we want the update to the model to be based off of the most recent model possible. (As in the case of serial SGD.) Staleness is our measure of a threads’ un-seen model updates between computing and applying it’s own update. T = #Threads, S = #Samples per SGDBatch, HS = #Samples per HogBatch Min Stale (For Final Sample in Batch) (best case: divide out asynchronicity) Max Stale (worst case) Example T = 8, S = 1024, HS = (S/T) <min, max> Hogwild T-1 < 0, 7> SGD Batch S <1024, 1024> HogBatch HS (T*HS) < 0, 1024>

Improvements to Staleness
Improves staleness for batches. Experimentally: up to 30% improvement in convergence per unit time. T = #Threads, S = #Samples per SGDBatch, HS = #Samples per HogBatch Min Stale (For Final Sample in Batch) (subtract staleness of samples per thread) Max Stale Example T = 8, S = 1024, HS = (S/T) <min, max> Hogwild T-1 < 0, 7> SGDBatch (S) - (S/T) (S) - (S/T) <896, 896> HogBatch (HS) – (HS) (T*HS) – (HS) < 0, 896>

for (st = 0; st < Samples/SIZE; st += SIZE) { #pragma omp parallel for for (index = st; index < SIZE; index++) { p_tid += w – a * g(index, [w + p_tid]) } // synchronize #pragma omp parallel for ** for (t = 0; t < numThreads; t++) { model = model + p_tid ** actual reduction done without conflict #pragma omp parallel for schedule(dynamic) for (st = 0; st < Samples/SIZE; st += SIZE) { for (index = st; index < SIZE; index++) { p_tid += w – a * g(index, [w + p_tid]) } model = model + p_tid Reduces staleness by a factor of [size / numThreads] Since each thread aggregates local update.

for (st = 0; st < Samples/SIZE; st += SIZE) { #pragma omp parallel for for (index = st; index < SIZE; index++) { p_tid += w – a * g(index, [w + p_tid]) } // synchronize #pragma omp parallel for ** for (t = 0; t < numThreads; t++) { model = model + p_tid ** actual reduction done without conflict #pragma omp parallel for schedule(dynamic) for (st = 0; st < Samples/SIZE; st += SIZE) { for (index = st; index < SIZE; index++) { p_tid += w – a * g(index, [w + p_tid]) } model = model + p_tid Keep w read only. Write w once at the end of the batch.

HogBatch: Note On Batch Size
Mathematically, should be regular mini-batch size divided by number of threads, to achieve same number of “samples in flight”. e.g.: 4 threads working together on a batch of 100, or 4 threads working individually on batches of size 25 each. Due to reduction in average staleness: not necessarily true. In practice we can choose a larger batch size.

Parallel scaling of Strategies Sparse Problem
RCV1 test dataset: Frequency Hogwild Better to double frequency than double core count. core-to-core interconnect speedup is needed. Batching Nearly equivalent to double frequency or core count. This is due to private vectors not causing core-to-core traffic. Time (in seconds) to 99.5% optimal. Dataset Name Examples Features NNZ Sparsity (%) NNZ/Row Range Avg NNZ/Row rcv1-test 677,399 47,236 49,556,258 0.155 4 to 1,224 73.157

Raw time data RCV1 Covtype

High Performance Parallel Stochastic Gradient Descent

Similar presentations

Presentation on theme: "High Performance Parallel Stochastic Gradient Descent"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance Parallel Stochastic Gradient Descent

Similar presentations

Presentation on theme: "High Performance Parallel Stochastic Gradient Descent"— Presentation transcript:

Similar presentations

About project

Feedback