Presentation is loading. Please wait.

Presentation is loading. Please wait.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

Similar presentations


Presentation on theme: "Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The."— Presentation transcript:

1 Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The Ohio State University Columbus, Ohio - 43210

2 Outline Motivation Software Transactional Memory Shared Memory Parallelization Schemes Transactional Locking II (TL2) Hybrid Replicated STM scheme FREERIDE Processing Structure Experimental Results Conclusions October 24, 20152

3 Motivation Availability of large data for analysis –On the scale of tera and peta bytes Advent of multi-core, many-core architecture –Intel’s Polaris 80-core chip –Larrabee many-core architecture Programmability challenge –Coarse-grained Performance not sufficient –Fine-grained Better left to experts Need for transparent, scalable shared- memory parallelization technique October 24, 20153

4 Software Transactional Memory (STM) October 24, 20154 Maps concurrent transactions in database to concurrent thread operations Programmer Identify critical sections Tag them as transactions Launch multiple threads Transactions run as atomic and isolated operations Data races handled automatically Guarantees absence of deadlock!

5 Contributions October 24, 20155

6 FREERIDE Processing Structure (Framework for Rapid Implementation of Datamining Engines) October 24, 20156 {* Outer sequential loop*} While( ) { {*Reduction loop*} Foreach( element e) { (i, val) = compute(e) RObj(i) = Reduc(Robj(i), val) } Map-reduce Two-stage FREERIDE One-stage Intermediate structure exposed Better Performance than map- reduce [Cluster ‘09] Middleware API Process each data instance Reduce the result into Reduction object Local combination from all threads if needed Reduction Object

7 Shared-Memory Parallelization Techniques Context: FREERIDE ( Framework for Rapid Implementation of Datamining Engines) October 24, 20157 Replication-based (Lock-free) Full-replication (f-r) Lock-based Full-locking Cache-sensitive locking (cs-l) Full Locking Cache-Sensitive Locking LockReduction Element

8 Motivation for STM Integration Potential downside of existing schemes [CCGRID ‘09] Full-replication –Very high memory requirements Cache-sensitive locking –Tuned for specific cache architecture –Risk of introducing bugs, deadlocks with porting Advantages of STM Leverage on large body of STM work –Easier programmability –No deadlocks! Provide transparent integration –Programmer don’t bother about STM details What do we need? Use easy programmability of STM Achieve competitive performance October 24, 20158

9 Transactional Locking II (TL2) October 24, 20159 Word-based, Lock-based algorithm Faster than non-blocking STM techniques API –STMBeginTransaction() –STMWrite() –STMRead() –STMCommit() We used Rochester STM (RSTM-TL2) Downside of STM –Large number of conflicts -> large number of aborts

10 Optimization – Hybrid Replicated STM (rep-stm) October 24, 201510 Best of two worlds –Replication –STM Replicated STM –Group ‘n’ threads by ‘m’ groups –‘m’ copies of Reduction object –Each group of threads has private copy –n/m threads within a group share to use STM Adv. of Replicated STM –Reduce no. of reduction object copies –Reduce merge overhead –Also, reduce conflicts with STM

11 October 24, 201511 Experimental Goals Setup Intel Xeon E5345 processors 2-Quad cores (8 cores), each core 2.33 GHz 6 GB main memory 8 MB L2 cache Goals Compare f-r, cs-l, TL2 and rep-stm for three datamining alogrithms –K-means, Expectation Maximization (E-M) and Principal Component Analysis (PCA) Evaluate different Read-Write mixes Evaluate conflicts and aborts October 24, 201511

12 Parallel Efficiency of PCA Principal Component Analysis 8.5 GB data Best result rep-stm (6.1x) Observations All techs. are competitive PCA specific Computation is high for finding co-variance matrix Amortizes the revalidation /acquire/release of locks STM overheads, 2.3% October 24, 201512

13 Parallel Efficiency of EM October 24, 201513 Expectation-Maximization (EM) 6.4 GB data Best result cs-l (~ 5x) Observations STM schemes are competitive STM have better scalability Diff. between stm-TL2/rep- stm not observed with 8 cores EM specific Computation between updates is high Again, initial overhead is high

14 Canonical Loop – Parallel Efficiency for Read-write Mixes Canonical loop Synthetic computation Follows generalized reduction Diff workloads with R/W mix All results from 8-threads Interesting Diff. winner for each workload October 24, 201514

15 Evaluation of Conflicts and Aborts Same canonical loop Compare rate of aborts for stm-TL2 and rep-stm Demonstrates the adv. of rep-stm over stm-TL2 for large no. of threads All cases, for rep-stm –Rate of growth of aborts is much slower –Reduces aborts by 40-55% October 24, 201515

16 Conclusions Transparent use of STM schemes Developed Hybrid Replicated-STM to reduce –Memory requirements –Conflicts/aborts TL2 and rep-stm are competitive with highly-tuned locking scheme rep-stm significantly reduces no. of aborts with TL2 October 24, 201516

17 October 24, 201517 Thank You! Questions? Contacts: Vignesh Ravi- raviv@cse.ohio-state.eduraviv@cse.ohio-state.edu Gagan Agrawal- agrawal@cse.ohio-state.eduagrawal@cse.ohio-state.edu

18 Parallel Efficiency of K-means Kmeans clustering 6 GB data, k=250 Best result f-r (6.57x) STM overheads 15.3% Revalidate R/W Acquire/Release locks Kmeans specific Computation between updates is quite low October 24, 201518


Download ppt "Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The."

Similar presentations


Ads by Google