1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.

Slides:

Advertisements

Similar presentations

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Advertisements

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.

High Performing Cache Hierarchies for Server Workloads

Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt.

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Multi-Core Architectures and Shared Resource Management Prof. Onur Mutlu Bogazici University June 6, 2013.

Veynu Narasiman The University of Texas at Austin Michael Shebanow

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.

Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

MorphCore: An Energy-Efficient Architecture for High-Performance ILP and High-Throughput TLP Khubaib * M. Aater Suleman *+ Milad Hashemi * Chris Wilkerson.

18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

The Evicted-Address Filter

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Sunpyo Hong, Hyesoon Kim

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Conditional Memory Ordering Christoph von Praun, Harold W.Cain, Jong-Deok Choi, Kyung Dong Ryu Presented by: Renwei Yu Published in Proceedings of the.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

740: Computer Architecture Architecting and Exploiting Asymmetry (in Multi-Core Architectures) Prof. Onur Mutlu Carnegie Mellon University.

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

Architecting and Exploiting Asymmetry in Multi-Core Architectures Onur Mutlu July 23, 2013 BSC/UPC.

Spring 2011 Parallel Computer Architecture Lecture 11: Core Fusion and Multithreading Prof. Onur Mutlu Carnegie Mellon University.

Samira Khan University of Virginia Feb 25, 2016 COMPUTER ARCHITECTURE CS 6354 Asymmetric Multi-Cores The content and concept of this course are adapted.

Computer Architecture Lecture 32: Heterogeneous Systems Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 4/20/2015.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Computer Architecture Lecture 31: Asymmetric Multi-Core Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 4/30/2014.

18-447: Computer Architecture Lecture 27: Multi-Core Potpourri

Architecting and Exploiting Asymmetry in Multi-Core Architectures

Adaptive Cache Partitioning on a Composite Core

CS5102 High Performance Computer Systems Thread-Level Parallelism

15-740/ Computer Architecture Lecture 17: Asymmetric Multi-Core

ISPASS th April Santa Rosa, California

Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads

Samira Khan University of Virginia Sep 13, 2017

18-740/640 Computer Architecture Lecture 17: Heterogeneous Systems

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Prof. Onur Mutlu Carnegie Mellon University

José A. Joao* Onur Mutlu‡ Yale N. Patt*

Computer Architecture Lecture 16: Heterogeneous Multi-Core

Prof. Onur Mutlu Carnegie Mellon University 9/24/2012

Computer Architecture Lecture 20: Heterogeneous Multi-Core Systems II

Prof. Onur Mutlu Carnegie Mellon University 9/19/2012

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Computer Architecture Lecture 19b: Heterogeneous Multi-Core Systems

Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/29/2013

ASPLOS 2009 Presented by Lara Lazier

Presentation transcript:

1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The University of Texas at Austin †Carnegie Mellon University ‡IBM Research

2 Background To leverage CMPs: –Programs must be split into threads Mutual Exclusion: –Threads are not allowed to update shared data concurrently Accesses to shared data are encapsulated inside critical sections Only one thread can execute a critical section at a given time

Example of Critical Section from MySQL 3 × × List of Open Tables × × × Thread 0 Thread 1 Thread 2 Thread 3 A × BCD × E Thread 3: OpenTables(D, E) Thread 2: CloseAllTables()

Example Critical Section from MySQL 4 ABCD E 3

5 End of Transaction: foreach (table opened by thread) if (table.temporary) table.close() LOCK_open  Acquire() LOCK_open  Release()

6 Contention for Critical Sections t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 Critical Sections execute 2x faster Thread 1 Thread 2 Thread 3 Thread 4 Thread 1 Thread 2 Thread 3 Thread 4 Critical Section Parallel Idle Accelerating critical sections not only helps the thread executing the critical sections, but also the waiting threads

7 Impact of Critical Sections on Scalability Contention for critical sections increases with the number of threads and limits scalability MySQL (oltp-1) Chip Area (cores) Speedup

8 Outline Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary

9 The Asymmetric Chip Multiprocessor (ACMP) Provide one large core and many small cores Execute parallel part on small cores for high throughput Accelerate serial part using the large core Niagara -like core Large core ACMP Approach

10 Conventional ACMP EnterCS() PriorityQ.insert(…) LeaveCS() On-chip Interconnect 1.P2 encounters a Critical Section 2.Sends a request for the lock 3.Acquires the lock 4.Executes Critical Section 5.Releases the lock Core executing critical section P1 P2P3P4

11 Accelerating Critical Sections (ACS) Accelerate Amdahl’s serial part and critical sections using the large core Niagara -like core Large core ACMP Approach Critical Section Request Buffer (CSRB)

12 Accelerated Critical Sections (ACS) EnterCS() PriorityQ.insert(…) LeaveCS() Onchip- Interconnect Critical Section Request Buffer (CSRB) 1. P2 encounters a Critical Section 2. P2 sends CSCALL Request to CSRB 3. P1 executes Critical Section 4. P1 sends CSDONE signal Core executing critical section P4P3P2 P1

13 Architecture Overview ISA extensions –CSCALL LOCK_ADDR, TARGET_PC –CSRET LOCK_ADDR Compiler/Library inserts CSCALL/CSRET On a CSCALL, the small core: –Sends a CSCALL request to the large core Arguments: Lock address, Target PC, Stack Pointer, Core ID –Stalls and waits for CSDONE Large Core –Critical Section Request Buffer (CSRB) –Executes the critical section and sends CSDONE to the requesting core

14 False Serialization ACS can serialize independent critical sections Selective Acceleration of Critical Sections (SEL) –Saturating counters to track false serialization CSCALL (A) CSCALL (B) Critical Section Request Buffer (CSRB) 4 4 A B 32 5 To large core From small cores

15 Outline Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary

16 Performance Tradeoffs Fewer threads vs. accelerated critical sections –Accelerating critical sections offsets loss in throughput –As the number of cores (threads) on chip increase: Fractional loss in parallel performance decreases Increased contention for critical sections makes acceleration more beneficial Overhead of CSCALL/CSDONE vs. better lock locality –ACS avoids “ping-ponging” of locks among caches by keeping them at the large core More cache misses for private data vs. fewer misses for shared data

17 Cache misses for private data Private Data: NewSubProblems Shared Data: The priority heap PriorityHeap.insert(NewSubProblems) Puzzle Benchmark

18 Performance Tradeoffs Fewer threads vs. accelerated critical sections –Accelerating critical sections offsets loss in throughput –As the number of cores (threads) on chip increase: Fractional loss in parallel performance decreases Increased contention for critical sections makes acceleration more beneficial Overhead of CSCALL/CSDONE vs. better lock locality –ACS avoids “ping-ponging” of locks among caches by keeping them at the large core More cache misses for private data vs. fewer misses for shared data –Cache misses reduce if shared data > private data

19 Outline Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary

20 Experimental Methodology Niagara -like core SCMP All small cores Conventional locking Niagara -like core Large core ACMP One large core (area-equal 4 small cores) Conventional locking Niagara -like core Large core ACS ACMP with a CSRB Accelerates Critical Sections

21 Experimental Methodology Workloads –12 critical section intensive applications from various domains –7 use coarse-grain locks and 5 use fine-grain locks Simulation parameters: –x86 cycle accurate processor simulator –Large core: Similar to Pentium-M with 2-way SMT. 2GHz, out-of-order, 128-entry ROB, 4-wide issue, 12-stage –Small core: Similar to Pentium 1, 2GHz, in-order, 2-wide issue, 5- stage –Private 32 KB L1, private 256KB L2, 8MB shared L3 –On-chip interconnect: Bi-directional ring

22 Workloads with Coarse-Grain Locks Chip Area = 16 cores SCMP = 16 small cores ACMP/ACS = 1 large and 12 small cores Equal-area comparison Number of threads = Best threads Chip Area = 32 small cores SCMP = 32 small cores ACMP/ACS = 1 large and 28 small cores

23 Workloads with Fine-Grain Locks Equal-area comparison Number of threads = Best threads Chip Area = 16 cores SCMP = 16 small cores ACMP/ACS = 1 large and 12 small cores Chip Area = 32 small cores SCMP = 32 small cores ACMP/ACS = 1 large and 28 small cores

Equal-Area Comparisons 24 Speedup over a small core Chip Area (small cores) (a) ep(b) is(c) pagemine(d) puzzle(e) qsort(f) tsp (i) oltp-1(i) oltp-2(h) iplookup(k) specjbb (l) webcache (g) sqlite Number of threads = No. of cores SCMP ACMP ACS

25 ACS on Symmetric CMP Majority of benefit is from large core

26 Outline Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary

27 Related Work Improving locality of shared data by thread migration and software prefetching (Sridharan+, Trancoso+, Ranganathan+) ACS not only improves locality but also uses a large core to accelerate critical section execution Asymmetric CMPs (Morad+, Kumar+, Suleman+, Hill+) ACS not only accelerates the Amdahl’s bottleneck but also critical sections Remote procedure calls (Birrell+) ACS is for critical sections among shared memory cores

28 Hiding Latency of Critical Sections Transactional memory (Herlihy+) ACS does not require code modification Transactional Lock Removal (Rajwar+) and Speculative Synchronization (Martinez+) –Hide critical section latency by increasing concurrency ACS reduces latency of each critical section –Overlaps execution of critical sections with no data conflicts ACS accelerates ALL critical sections –Does not improve locality of shared data ACS improves locality of shared data  ACS outperforms TLR (Rajwar+) by 18% (details in paper)

29 Conclusion Critical sections reduce performance and limit scalability Accelerate critical sections by executing them on a powerful core ACS reduces average execution time by: –34% compared to an equal-area SCMP –23% compared to an equal-area ACMP ACS improves scalability of 7 of the 12 workloads

30 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The University of Texas at Austin †Carnegie Mellon University ‡IBM Research