Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.

Similar presentations


Presentation on theme: "1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The."— Presentation transcript:

1 1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The University of Texas at Austin †Carnegie Mellon University ‡IBM Research

2 2 Background To leverage CMPs: –Programs must be split into threads Mutual Exclusion: –Threads are not allowed to update shared data concurrently Accesses to shared data are encapsulated inside critical sections Only one thread can execute a critical section at a given time

3 Example of Critical Section from MySQL 3 × × List of Open Tables × × × Thread 0 Thread 1 Thread 2 Thread 3 A × BCD × E Thread 3: OpenTables(D, E) Thread 2: CloseAllTables()

4 Example Critical Section from MySQL 4 ABCD 0 2 2 1 0 3 E 3

5 5 End of Transaction: foreach (table opened by thread) if (table.temporary) table.close() LOCK_open  Acquire() LOCK_open  Release()

6 6 Contention for Critical Sections t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 Critical Sections execute 2x faster Thread 1 Thread 2 Thread 3 Thread 4 Thread 1 Thread 2 Thread 3 Thread 4 Critical Section Parallel Idle Accelerating critical sections not only helps the thread executing the critical sections, but also the waiting threads

7 7 Impact of Critical Sections on Scalability Contention for critical sections increases with the number of threads and limits scalability MySQL (oltp-1) Chip Area (cores) Speedup

8 8 Outline Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary

9 9 The Asymmetric Chip Multiprocessor (ACMP) Provide one large core and many small cores Execute parallel part on small cores for high throughput Accelerate serial part using the large core Niagara -like core Large core ACMP Approach

10 10 Conventional ACMP EnterCS() PriorityQ.insert(…) LeaveCS() On-chip Interconnect 1.P2 encounters a Critical Section 2.Sends a request for the lock 3.Acquires the lock 4.Executes Critical Section 5.Releases the lock Core executing critical section P1 P2P3P4

11 11 Accelerating Critical Sections (ACS) Accelerate Amdahl’s serial part and critical sections using the large core Niagara -like core Large core ACMP Approach Critical Section Request Buffer (CSRB)

12 12 Accelerated Critical Sections (ACS) EnterCS() PriorityQ.insert(…) LeaveCS() Onchip- Interconnect Critical Section Request Buffer (CSRB) 1. P2 encounters a Critical Section 2. P2 sends CSCALL Request to CSRB 3. P1 executes Critical Section 4. P1 sends CSDONE signal Core executing critical section P4P3P2 P1

13 13 Architecture Overview ISA extensions –CSCALL LOCK_ADDR, TARGET_PC –CSRET LOCK_ADDR Compiler/Library inserts CSCALL/CSRET On a CSCALL, the small core: –Sends a CSCALL request to the large core Arguments: Lock address, Target PC, Stack Pointer, Core ID –Stalls and waits for CSDONE Large Core –Critical Section Request Buffer (CSRB) –Executes the critical section and sends CSDONE to the requesting core

14 14 False Serialization ACS can serialize independent critical sections Selective Acceleration of Critical Sections (SEL) –Saturating counters to track false serialization CSCALL (A) CSCALL (B) Critical Section Request Buffer (CSRB) 4 4 A B 32 5 To large core From small cores

15 15 Outline Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary

16 16 Performance Tradeoffs Fewer threads vs. accelerated critical sections –Accelerating critical sections offsets loss in throughput –As the number of cores (threads) on chip increase: Fractional loss in parallel performance decreases Increased contention for critical sections makes acceleration more beneficial Overhead of CSCALL/CSDONE vs. better lock locality –ACS avoids “ping-ponging” of locks among caches by keeping them at the large core More cache misses for private data vs. fewer misses for shared data

17 17 Cache misses for private data Private Data: NewSubProblems Shared Data: The priority heap PriorityHeap.insert(NewSubProblems) Puzzle Benchmark

18 18 Performance Tradeoffs Fewer threads vs. accelerated critical sections –Accelerating critical sections offsets loss in throughput –As the number of cores (threads) on chip increase: Fractional loss in parallel performance decreases Increased contention for critical sections makes acceleration more beneficial Overhead of CSCALL/CSDONE vs. better lock locality –ACS avoids “ping-ponging” of locks among caches by keeping them at the large core More cache misses for private data vs. fewer misses for shared data –Cache misses reduce if shared data > private data

19 19 Outline Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary

20 20 Experimental Methodology Niagara -like core SCMP All small cores Conventional locking Niagara -like core Large core ACMP One large core (area-equal 4 small cores) Conventional locking Niagara -like core Large core ACS ACMP with a CSRB Accelerates Critical Sections

21 21 Experimental Methodology Workloads –12 critical section intensive applications from various domains –7 use coarse-grain locks and 5 use fine-grain locks Simulation parameters: –x86 cycle accurate processor simulator –Large core: Similar to Pentium-M with 2-way SMT. 2GHz, out-of-order, 128-entry ROB, 4-wide issue, 12-stage –Small core: Similar to Pentium 1, 2GHz, in-order, 2-wide issue, 5- stage –Private 32 KB L1, private 256KB L2, 8MB shared L3 –On-chip interconnect: Bi-directional ring

22 22 Workloads with Coarse-Grain Locks Chip Area = 16 cores SCMP = 16 small cores ACMP/ACS = 1 large and 12 small cores Equal-area comparison Number of threads = Best threads Chip Area = 32 small cores SCMP = 32 small cores ACMP/ACS = 1 large and 28 small cores 210150210150

23 23 Workloads with Fine-Grain Locks Equal-area comparison Number of threads = Best threads Chip Area = 16 cores SCMP = 16 small cores ACMP/ACS = 1 large and 12 small cores Chip Area = 32 small cores SCMP = 32 small cores ACMP/ACS = 1 large and 28 small cores

24 Equal-Area Comparisons 24 Speedup over a small core Chip Area (small cores) (a) ep(b) is(c) pagemine(d) puzzle(e) qsort(f) tsp (i) oltp-1(i) oltp-2(h) iplookup(k) specjbb (l) webcache (g) sqlite Number of threads = No. of cores ------ SCMP ------ ACMP ------ ACS

25 25 ACS on Symmetric CMP Majority of benefit is from large core

26 26 Outline Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary

27 27 Related Work Improving locality of shared data by thread migration and software prefetching (Sridharan+, Trancoso+, Ranganathan+) ACS not only improves locality but also uses a large core to accelerate critical section execution Asymmetric CMPs (Morad+, Kumar+, Suleman+, Hill+) ACS not only accelerates the Amdahl’s bottleneck but also critical sections Remote procedure calls (Birrell+) ACS is for critical sections among shared memory cores

28 28 Hiding Latency of Critical Sections Transactional memory (Herlihy+) ACS does not require code modification Transactional Lock Removal (Rajwar+) and Speculative Synchronization (Martinez+) –Hide critical section latency by increasing concurrency ACS reduces latency of each critical section –Overlaps execution of critical sections with no data conflicts ACS accelerates ALL critical sections –Does not improve locality of shared data ACS improves locality of shared data  ACS outperforms TLR (Rajwar+) by 18% (details in paper)

29 29 Conclusion Critical sections reduce performance and limit scalability Accelerate critical sections by executing them on a powerful core ACS reduces average execution time by: –34% compared to an equal-area SCMP –23% compared to an equal-area ACMP ACS improves scalability of 7 of the 12 workloads

30 30 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The University of Texas at Austin †Carnegie Mellon University ‡IBM Research


Download ppt "1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The."

Similar presentations


Ads by Google