Thread criticality for power efficiency in CMPs Khairul Kabir Nov. 3 rd, 2009 ECE 692 Topic Presentation 1.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
1 A Hybrid Adaptive Feedback Based Prefetcher Santhosh Verma, David Koppelman and Lu Peng Louisiana State University.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
Lecture 6: Multicore Systems
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
Project Proposal Presented by Michael Kazecki. Outline Background –Algorithms Goals Ideas Proposal –Introduction –Motivation –Implementation.
Green Governors: A Framework for Continuously Adaptive DVFS Vasileios Spiliopoulos, Stefanos Kaxiras Uppsala University, Sweden.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Register Packing Exploiting Narrow-Width Operands for Reducing Register File Pressure Oguz Ergin*, Deniz Balkan, Kanad Ghose, Dmitry Ponomarev Department.
Simple Criticality Predictors for Dynamic Performance and Power Management in CMPs Group Talk: Dec 10, 2008 Abhishek Bhattacharjee.
Multiscalar processors
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Hardware Supported Time Synchronization in Multi-Core Architectures 林孟諭 Dept. of Electrical Engineering National Cheng Kung University Tainan, Taiwan,
Low-Power Wireless Sensor Networks
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Ramazan Bitirgen, Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors.
Critical Power Slope Understanding the Runtime Effects of Frequency Scaling Akihiko Miyoshi, Charles Lefurgy, Eric Van Hensbergen Ram Rajamony Raj Rajkumar.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
1 Process Scheduling in Multiprocessor and Multithreaded Systems Matt Davis CS5354/7/2003.
1 Vulnerabilities on high-end processors André Seznec IRISA/INRIA CAPS project-team.
VGreen: A System for Energy Efficient Manager in Virtualized Environments G. Dhiman, G Marchetti, T Rosing ISLPED 2009.
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Combining Software and Hardware Monitoring for Improved Power and Performance Tuning Eric Chi, A. Michael Salem, and R. Iris Bahar Brown University Division.
Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.
University of Michigan Electrical Engineering and Computer Science 1 Embracing Heterogeneity with Dynamic Core Boosting Hyoun Kyu Cho and Scott Mahlke.
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Sunpyo Hong, Hyesoon Kim
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Best detection scheme achieves 100% hit detection with
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Computer Structure Multi-Threading
5.2 Eleven Advanced Optimizations of Cache Performance
Hyperthreading Technology
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Hardware Multithreading
Adaptive Single-Chip Multiprocessing
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Sampoorani, Sivakumar and Joshua
Programming with Shared Memory Specifying parallelism
Presentation transcript:

Thread criticality for power efficiency in CMPs Khairul Kabir Nov. 3 rd, 2009 ECE 692 Topic Presentation 1

Why Thread Criticality prediction? 2  Critical thread −One with the longest completion time in the parallel region D-Cache Miss I-Cache Miss Stall T0 T1T2T3 Insts Exec  Problems −Performance degradation −Energy inefficiency  Sources of variability −Algorithm, process variation, thermal emergencies etc.  Purpose −Load balancing for performance improvement −Energy optimization using DVFS

Related Work  Instruction criticality [Fields et al., Tune et al etc.] −Identify critical instruction  Thrifty barrier [Li et al. 2005] −Faster cores transitioned into low-power mode based on prediction of barrier stall time.  DVFS for energy-efficiency at barriers [Liu et al. 2005] −Faster core tracks the waiting time and predicts the DVFS for next execution of the same parallel loop  Meeting points [Cai et al. 2008] −DVFS non-critical threads by tracking loop iterations completion rate across cores (parallel loops) 3

Abhishek Bhattacharjee Margaret Martonosi Dept. of Electrical Engineering Princeton University Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors 4

What is This Paper About? 5  Thread criticality predictor(TCP) design −Methodology −Identify architectural events impacting thread criticality −Introduce basic TCP hardware  Thread criticality predictor uses −Apply to Intel’s Threading Building Blocks(TBB) −Apply for energy-efficiency in barrier-based programs

Design goals 1. Accuracy 2. Low-overhead implementation Simple HW (allow SW policies to be built on top) 3. One predictor, many uses Design decisions 1. Find suitable architectural metric 2. History-based local approach versus thread-comparative approach 3. This paper: TBB, DVFS and other uses: shared last-level cache management, SMT and memory priority, … 6 Thread Criticality Prediction Goals

Methodology  Evaluations on a range of architectures: high- performance and embedded domains – GEMS simulator – To evaluate the performance on architectures representative of the high-performance domain – ARM simulator – To evaluate the performance benefits of TCP-guided task stealing in Intel’s TBB – FPGA-based emulator used to assess energy savings from TCP-guided DVFS 7 Infrastructure Domain System Cores Caches GEMS Simulator High-performance, wide-issue, out-of-order 16 core CMP with Solaris 10 4-issue SPARC 32KB L1, 4MB L2 ARM Simulator Embedded, in-order 4-32 core CMP 2-issue ARM 32KB L1, 4MB L2 FPGA Emulator Embedded, in-order 4-core CMP with Linux issue SPARC 4KB I-Cache, 8KB D-Cache

Architectural Metrics  History-based TCP – Requires repetitive barrier behavior – Information local to core: no communication – Problem for in-order pipelines: variant IPCs  Inter-core TCP metrics – Instruction count – Cache misses – Control flow changes – Translate lookaside buffer(TLB) miss 8

Thread-Comparative Metrics for TCP: Instruction Counts 9

Thread-Comparative Metrics for TCP: L1 D Cache Misses 10

Thread-Comparative Metrics for TCP: L1 I & D Cache Misses 11

Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses 12

Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses 13

Basic TCP Hardware 14  TCP hardware components −Per core criticality counters −Interval bound register

Basic TCP Hardware Core 0Core 1Core 2 L1 I $L1 D $L1 I $L1 D $L1 I $L1 D $ Core 3 L1 I $L1 D $ Shared L2 Cache L2 Controller TCP Hardware Inst 1 Inst 2 Inst 5 Inst 5: L1 D$ Miss! Inst 5 Criticality Counters L1 Cache Miss! Inst 15 Inst 5: Miss Over Inst 15 Inst 20Inst 10 Inst 20: L1 I$ Miss! Inst 20 L1 Cache Miss! Inst 30Inst 20 Inst 20: Miss Over Inst 30Inst 35 Inst 25: L2 $ Miss Inst 25Inst 35 L2 Cache Miss! Per-core Criticality Counters track poorly cached, slow threads Inst 135 Inst 25: Miss Over Inst 125Inst 135 Periodically refresh criticality counters with Interval Bound Register 15

TBB Task Stealing & Thread Criticality  TBB dynamic scheduler distributes tasks  Each thread maintains software queue filled with tasks – Empty queue – thread “steals” task from another thread’s queue  Approach 1: Default TBB uses random task stealing – More failed steals at higher core counts  poor performance  Approach 2: Occupancy-based task stealing [Contreras, Martonosi, 2008] – Steal based on number of items in SW queue – Must track and compare maximum occupancy counts 16

TCP-Guided TBB Task Stealing Core 0 SW Q0 Shared L2 Cache Core 1 SW Q1 Core 2 SW Q2 Core 3 SW Q3 Criticality Counters Interval Bound Register Task 1 TCP Control Logic Task 0 Task 2 Task 3 Task 4Task 5Task 6 Task 7 Clock: 0Clock: Core 3: L2 Miss 11 Clock: 30Clock: 100 None Core 2: Steal Req. Scan for max val. Steal from Core 3 Task 7 TCP initiates steals from critical thread Modest message overhead: L2 access latency Scalable: 14-bit criticality counters  114 bytes of 64 cores Core 3: L1 Miss 17

TCP-Guided TBB Task Stealing  TBB with random task stealing  TBB with TCP-guided task ste aling 18

TCP-Guided TBB Performance % Performance improvement versus random task stealing Avg. Improvement over Random (32 cores) = 21.6 % Avg. Improvement over Occupancy (32 cores) = 13.8 % 19

Adapting TCP for Energy Efficiency in Barrier-Based Programs T0 T1 T2 T3 Insts Exec L2 D$ Miss L2 D$ Over T1 critical, => DVFS T0, T2, T3  Approach: DVFS non- critical threads to eliminate barrier stall time  Challenges: Relative criticalities Miss-prediction costs DVFS overheads 20

Hardware and Algorithm for TCP- Guided DVFS 21  TCP hardware components −Criticality counters −SST – Switching Suggestion Table −SCT – Suggestion Confidence Table −Interval bound register  TCP-guided DVFS algorithm – two key steps  Use SST to translate criticality counter values into thread criticalities −Criticality counter value is above a pre-defined threshold T and running at the nominal frequency −Determined by criticality counter value with SST entries −Suggests frequency switch if matching SST entry is different from current frequency  Feeds the suggested target frequency from SST to the SCT −Assesses confidence on SST’s DVFS suggestion

TCP-Guided DVFS – Effect of Criticality Counter Threshold 22 −Lowest bar – pre-calculated or correct state, averaged across all barrier instances −Central bar – learning time taken until the correct DVFS state is first reached −Upper bar – prediction noise or time spent in erroneous DVFS after having arrived at the correct one  Low T increases susceptibility to temporal noise  Too many frequency changes and performance overhead result without good suggestion confidence

TCP for DVFS: Results Average 15% energy savings 23  Benchmark with more load imbalance generally save more energy

Conclusions  Goal 1: Accuracy – Accurate TCPs based on simple cache statistics  Goal 2: Low-overhead hardware – Scalable per-core criticality counters used – TCP in central location where cache information is already available  Goal 3: Versatility – TBB improved by 13.8% over best known 32 cores – DVFS used to achieve 15% energy savings – Two uses shown, many others possible… 24

Qiong Cai, José González, Ryan Rakvic, Grigorios Magklis, Pedro Chaparro Antonio González Meeting Points: Using Thread Criticality to Adapt Multi-core Hardware to parallel Region 25

Introduction 26  Proposed applications  Thread delaying for multi-core systems −Save energy consumptions by scaling down the frequency and voltage of the cores containing non-critical threads  Thread balancing for simultaneous multi-threaded cores −Improves overall performance by giving higher priority to the critical thread  Meeting point thread characterization −Identifies the critical thread of a single multi threaded application −Identifies amount of the slacks of non-critical threads

Example: a parallelized loop from PageRank (lz77 method)  Observations: 1. The code is already written to achieve workload balance but imbalance still exists. CPU1 is slower than CPU0. 2. Reasons for imbalance: (i) Different cache misses (ii) Different control paths How To Find Critical Threads Dynamically? 27

Identification of Critical Threads 28  Identification technique −A thread-private counter is incremented −The most critical thread is the one with the smallest counter −Slack of a thread is estimated as the difference of its counter and the counter of the slowest counter  Insertion of meeting points −Place in a parallel region that is visited by all thread −Can be done by the hardware, the compiler or the programmer

Thread delaying 29  Make non-critical threads run at a lower frequency/voltage level −All threads arrive at the barrier at the same time  CPUs of the non-critical threads, can be put into deep sleep −Consumes almost zero energy −Not the most energy-efficient approach to deal with workload imbalance

Thread Delaying B A D C barrier Area -> Energy needed to execute the instructions of the thread Thread 1 Thread 2 Thread 3 Thread 4 Frequency Proposal: Energy = Activity x Capacitance x Voltage 2  Reduce voltage when executing parallel threads  Delay threads arriving early to the barrier 30

Thread Delaying B A D C barrier Area -> Energy needed to execute the instructions of the thread Thread 1 Thread 2 Thread 3 Thread 4 Frequency

Thread Delaying A B C D B D C A BC D Energy Energy Saved 32

Implementation of Thread delaying 33  HISTORY-TABLE −An entry for each possible frequency level −2-bit up-down saturating counter  MP-COUNTER_TABLE −Contains as many entries as number of cores in the processor −32-bit counter −Consistent among all cores  Implementation −Each core broadcasts the counter value in each 10 execution of the meeting point instruction −Invoke thread delaying algorithm −History table is updated

Thread Balancing 34  Speeding up a parallel application running more than one thread −Two-way in-order SMT with an issue bandwidth of two instruction per cycle −Both threads have ready instructions, allow both of them −One thread has ready instruction, can issue up to two instruction per cycle −If threads belong the same parallel application, prioritize critical thread  Thread balancing −Identify critical thread −Give the critical thread more priority in the issue logic

Thread Balancing Logic  Targeted for 2-way SMT: −Imbalance hardware logic: identify critical thread  Issue prioritization logic −If a thread is critical and it has two ready instructions, it is allowed to issue both instructions regardless of the number of ready instructions the non-critical thread has −Otherwise, the base issue policy is applied 35

Simulation Framework and Benchmarks 36  SoftSDV for Intel64/IA32 processor −Simulate multithreaded primitives including locks and synchronization operation and shared memory and events  RMS(Recognition, Mining, and Synthesis) benchmark −Highly data-intensive and highly parallel(computer vision, data mining, etc) −Benchmarks are parallelized by pthreads or OpenMP −99% of total execution is parallel for all except FIMI (28% coverage)

Performance Results for Thread delaying 37  Baseline is aggressive −Every core is running at full speed and stops when it is completed. Once the core stops, it consumes zero power  Save 4% - 44% energy −Energy savings come from the large frequency decreases on non-critical thread

Performance Results for Thread Balancing 38  Baseline is aggressive −Every core is running at full speed and stops when it is completed  Performance benefit ranges from 1% - 20% −Performance benefit correlates with imbalance levels

Conclusions 39  Meeting point thread characterization dynamically estimates the criticality of the threads in a parallel execution  Thread delaying combines per-core DVFS and meeting point thread characterization together to reduce energy consumption on non-critical threads  Thread balancing gives higher priority in the issue queue of an SMT core to the critical thread.

Comparison of the Two Papers Thread criticality predictorMeeting points TargetA range of parallelization schemes beyond parallel loop Parallel loops Critical thread identification method Cache behavior, L1 and L2 cache miss Meeting points count Performance balancing method Task stealing from critical thread Prioritizing critical thread Need extra hardware support Yes Energy saving technique DVFS BenchmarkSPLASH-2 and PARSECRMS EvaluationARM-based simulator, GEMS simulator, FPGA-based Emulator SoftSDV 40

Critiques 41  Paper 1 −It did not mention how to calculate the values for SST −Accuracy of barrier based DVFS depends on pre-calculated SST’s values  Paper 2 −The total number of times each thread visits the meeting point should be roughly same, that means meeting point thread characterization cannot handle variable loop iteration size −It just works well for parallel loop, but fails for any large parallel region without parallel loop −It might not be always feasible for hardware to detect parallel loop and insert the meeting point

42 Thank you !