Thread criticality for power efficiency in CMPs Khairul Kabir Nov. 3 rd, 2009 ECE 692 Topic Presentation 1.

Thread criticality for power efficiency in CMPs Khairul Kabir Nov. 3 rd, 2009 ECE 692 Topic Presentation 1

Why Thread Criticality prediction? 2  Critical thread −One with the longest completion time in the parallel region D-Cache Miss I-Cache Miss Stall T0 T1T2T3 Insts Exec  Problems −Performance degradation −Energy inefficiency  Sources of variability −Algorithm, process variation, thermal emergencies etc.  Purpose −Load balancing for performance improvement −Energy optimization using DVFS

Related Work  Instruction criticality [Fields et al., Tune et al. 2001 etc.] −Identify critical instruction  Thrifty barrier [Li et al. 2005] −Faster cores transitioned into low-power mode based on prediction of barrier stall time.  DVFS for energy-efficiency at barriers [Liu et al. 2005] −Faster core tracks the waiting time and predicts the DVFS for next execution of the same parallel loop  Meeting points [Cai et al. 2008] −DVFS non-critical threads by tracking loop iterations completion rate across cores (parallel loops) 3

Abhishek Bhattacharjee Margaret Martonosi Dept. of Electrical Engineering Princeton University Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors 4

What is This Paper About? 5  Thread criticality predictor(TCP) design −Methodology −Identify architectural events impacting thread criticality −Introduce basic TCP hardware  Thread criticality predictor uses −Apply to Intel’s Threading Building Blocks(TBB) −Apply for energy-efficiency in barrier-based programs

Design goals 1. Accuracy 2. Low-overhead implementation Simple HW (allow SW policies to be built on top) 3. One predictor, many uses Design decisions 1. Find suitable architectural metric 2. History-based local approach versus thread-comparative approach 3. This paper: TBB, DVFS and other uses: shared last-level cache management, SMT and memory priority, … 6 Thread Criticality Prediction Goals

Methodology  Evaluations on a range of architectures: high- performance and embedded domains – GEMS simulator – To evaluate the performance on architectures representative of the high-performance domain – ARM simulator – To evaluate the performance benefits of TCP-guided task stealing in Intel’s TBB – FPGA-based emulator used to assess energy savings from TCP-guided DVFS 7 Infrastructure Domain System Cores Caches GEMS Simulator High-performance, wide-issue, out-of-order 16 core CMP with Solaris 10 4-issue SPARC 32KB L1, 4MB L2 ARM Simulator Embedded, in-order 4-32 core CMP 2-issue ARM 32KB L1, 4MB L2 FPGA Emulator Embedded, in-order 4-core CMP with Linux 2.6 1-issue SPARC 4KB I-Cache, 8KB D-Cache

Architectural Metrics  History-based TCP – Requires repetitive barrier behavior – Information local to core: no communication – Problem for in-order pipelines: variant IPCs  Inter-core TCP metrics – Instruction count – Cache misses – Control flow changes – Translate lookaside buffer(TLB) miss 8

Thread-Comparative Metrics for TCP: Instruction Counts 9

Thread-Comparative Metrics for TCP: L1 D Cache Misses 10

Thread-Comparative Metrics for TCP: L1 I & D Cache Misses 11

Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses 12

Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses 13

Basic TCP Hardware 14  TCP hardware components −Per core criticality counters −Interval bound register

Basic TCP Hardware Core 0Core 1Core 2 L1 I $L1 D $L1 I $L1 D $L1 I $L1 D $ Core 3 L1 I $L1 D $ Shared L2 Cache L2 Controller TCP Hardware Inst 1 Inst 2 Inst 5 Inst 5: L1 D$ Miss! Inst 5 Criticality Counters 0 0 0 0 L1 Cache Miss! 0 1 0 0 Inst 15 Inst 5: Miss Over Inst 15 Inst 20Inst 10 Inst 20: L1 I$ Miss! Inst 20 L1 Cache Miss! 0 1 1 0 Inst 30Inst 20 Inst 20: Miss Over Inst 30Inst 35 Inst 25: L2 $ Miss Inst 25Inst 35 L2 Cache Miss! 0 11 1 0 Per-core Criticality Counters track poorly cached, slow threads Inst 135 Inst 25: Miss Over Inst 125Inst 135 Periodically refresh criticality counters with Interval Bound Register 15

TBB Task Stealing & Thread Criticality  TBB dynamic scheduler distributes tasks  Each thread maintains software queue filled with tasks – Empty queue – thread “steals” task from another thread’s queue  Approach 1: Default TBB uses random task stealing – More failed steals at higher core counts  poor performance  Approach 2: Occupancy-based task stealing [Contreras, Martonosi, 2008] – Steal based on number of items in SW queue – Must track and compare maximum occupancy counts 16

TCP-Guided TBB Task Stealing Core 0 SW Q0 Shared L2 Cache Core 1 SW Q1 Core 2 SW Q2 Core 3 SW Q3 Criticality Counters Interval Bound Register Task 1 TCP Control Logic Task 0 Task 2 Task 3 Task 4Task 5Task 6 Task 7 Clock: 0Clock: 10 00001 Core 3: L2 Miss 11 Clock: 30Clock: 100 None 145221 Core 2: Steal Req. Scan for max val. Steal from Core 3 Task 7 TCP initiates steals from critical thread Modest message overhead: L2 access latency Scalable: 14-bit criticality counters  114 bytes of storage @ 64 cores Core 3: L1 Miss 17

TCP-Guided TBB Task Stealing  TBB with random task stealing  TBB with TCP-guided task ste aling 18

TCP-Guided TBB Performance % Performance improvement versus random task stealing Avg. Improvement over Random (32 cores) = 21.6 % Avg. Improvement over Occupancy (32 cores) = 13.8 % 19

Adapting TCP for Energy Efficiency in Barrier-Based Programs T0 T1 T2 T3 Insts Exec L2 D$ Miss L2 D$ Over T1 critical, => DVFS T0, T2, T3  Approach: DVFS non- critical threads to eliminate barrier stall time  Challenges: Relative criticalities Miss-prediction costs DVFS overheads 20

Hardware and Algorithm for TCP- Guided DVFS 21  TCP hardware components −Criticality counters −SST – Switching Suggestion Table −SCT – Suggestion Confidence Table −Interval bound register  TCP-guided DVFS algorithm – two key steps  Use SST to translate criticality counter values into thread criticalities −Criticality counter value is above a pre-defined threshold T and running at the nominal frequency −Determined by criticality counter value with SST entries −Suggests frequency switch if matching SST entry is different from current frequency  Feeds the suggested target frequency from SST to the SCT −Assesses confidence on SST’s DVFS suggestion

TCP-Guided DVFS – Effect of Criticality Counter Threshold 22 −Lowest bar – pre-calculated or correct state, averaged across all barrier instances −Central bar – learning time taken until the correct DVFS state is first reached −Upper bar – prediction noise or time spent in erroneous DVFS after having arrived at the correct one  Low T increases susceptibility to temporal noise  Too many frequency changes and performance overhead result without good suggestion confidence

TCP for DVFS: Results Average 15% energy savings 23  Benchmark with more load imbalance generally save more energy

Conclusions  Goal 1: Accuracy – Accurate TCPs based on simple cache statistics  Goal 2: Low-overhead hardware – Scalable per-core criticality counters used – TCP in central location where cache information is already available  Goal 3: Versatility – TBB improved by 13.8% over best known approach @ 32 cores – DVFS used to achieve 15% energy savings – Two uses shown, many others possible… 24

Qiong Cai, José González, Ryan Rakvic, Grigorios Magklis, Pedro Chaparro Antonio González Meeting Points: Using Thread Criticality to Adapt Multi-core Hardware to parallel Region 25

Introduction 26  Proposed applications  Thread delaying for multi-core systems −Save energy consumptions by scaling down the frequency and voltage of the cores containing non-critical threads  Thread balancing for simultaneous multi-threaded cores −Improves overall performance by giving higher priority to the critical thread  Meeting point thread characterization −Identifies the critical thread of a single multi threaded application −Identifies amount of the slacks of non-critical threads

Example: a parallelized loop from PageRank (lz77 method)  Observations: 1. The code is already written to achieve workload balance but imbalance still exists. CPU1 is slower than CPU0. 2. Reasons for imbalance: (i) Different cache misses (ii) Different control paths How To Find Critical Threads Dynamically? 27

Identification of Critical Threads 28  Identification technique −A thread-private counter is incremented −The most critical thread is the one with the smallest counter −Slack of a thread is estimated as the difference of its counter and the counter of the slowest counter  Insertion of meeting points −Place in a parallel region that is visited by all thread −Can be done by the hardware, the compiler or the programmer

Thread delaying 29  Make non-critical threads run at a lower frequency/voltage level −All threads arrive at the barrier at the same time  CPUs of the non-critical threads, can be put into deep sleep −Consumes almost zero energy −Not the most energy-efficient approach to deal with workload imbalance

Thread Delaying B A D C barrier Area -> Energy needed to execute the instructions of the thread Thread 1 Thread 2 Thread 3 Thread 4 Frequency Proposal: Energy = Activity x Capacitance x Voltage 2  Reduce voltage when executing parallel threads  Delay threads arriving early to the barrier 30

Thread Delaying B A D C barrier Area -> Energy needed to execute the instructions of the thread Thread 1 Thread 2 Thread 3 Thread 4 Frequency 111122223333 31

Thread Delaying A B C D B D C A BC D Energy Energy Saved 32

Implementation of Thread delaying 33  HISTORY-TABLE −An entry for each possible frequency level −2-bit up-down saturating counter  MP-COUNTER_TABLE −Contains as many entries as number of cores in the processor −32-bit counter −Consistent among all cores  Implementation −Each core broadcasts the counter value in each 10 execution of the meeting point instruction −Invoke thread delaying algorithm −History table is updated

Thread Balancing 34  Speeding up a parallel application running more than one thread −Two-way in-order SMT with an issue bandwidth of two instruction per cycle −Both threads have ready instructions, allow both of them −One thread has ready instruction, can issue up to two instruction per cycle −If threads belong the same parallel application, prioritize critical thread  Thread balancing −Identify critical thread −Give the critical thread more priority in the issue logic

Thread Balancing Logic  Targeted for 2-way SMT: −Imbalance hardware logic: identify critical thread  Issue prioritization logic −If a thread is critical and it has two ready instructions, it is allowed to issue both instructions regardless of the number of ready instructions the non-critical thread has −Otherwise, the base issue policy is applied 35

Simulation Framework and Benchmarks 36  SoftSDV for Intel64/IA32 processor −Simulate multithreaded primitives including locks and synchronization operation and shared memory and events  RMS(Recognition, Mining, and Synthesis) benchmark −Highly data-intensive and highly parallel(computer vision, data mining, etc) −Benchmarks are parallelized by pthreads or OpenMP −99% of total execution is parallel for all except FIMI (28% coverage)

Performance Results for Thread delaying 37  Baseline is aggressive −Every core is running at full speed and stops when it is completed. Once the core stops, it consumes zero power  Save 4% - 44% energy −Energy savings come from the large frequency decreases on non-critical thread

Performance Results for Thread Balancing 38  Baseline is aggressive −Every core is running at full speed and stops when it is completed  Performance benefit ranges from 1% - 20% −Performance benefit correlates with imbalance levels

Conclusions 39  Meeting point thread characterization dynamically estimates the criticality of the threads in a parallel execution  Thread delaying combines per-core DVFS and meeting point thread characterization together to reduce energy consumption on non-critical threads  Thread balancing gives higher priority in the issue queue of an SMT core to the critical thread.

Comparison of the Two Papers Thread criticality predictorMeeting points TargetA range of parallelization schemes beyond parallel loop Parallel loops Critical thread identification method Cache behavior, L1 and L2 cache miss Meeting points count Performance balancing method Task stealing from critical thread Prioritizing critical thread Need extra hardware support Yes Energy saving technique DVFS BenchmarkSPLASH-2 and PARSECRMS EvaluationARM-based simulator, GEMS simulator, FPGA-based Emulator SoftSDV 40

Critiques 41  Paper 1 −It did not mention how to calculate the values for SST −Accuracy of barrier based DVFS depends on pre-calculated SST’s values  Paper 2 −The total number of times each thread visits the meeting point should be roughly same, that means meeting point thread characterization cannot handle variable loop iteration size −It just works well for parallel loop, but fails for any large parallel region without parallel loop −It might not be always feasible for hardware to detect parallel loop and insert the meeting point

42 Thank you !

Thread criticality for power efficiency in CMPs Khairul Kabir Nov. 3 rd, 2009 ECE 692 Topic Presentation 1.

Similar presentations

Presentation on theme: "Thread criticality for power efficiency in CMPs Khairul Kabir Nov. 3 rd, 2009 ECE 692 Topic Presentation 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Thread criticality for power efficiency in CMPs Khairul Kabir Nov. 3 rd, 2009 ECE 692 Topic Presentation 1.

Similar presentations

Presentation on theme: "Thread criticality for power efficiency in CMPs Khairul Kabir Nov. 3 rd, 2009 ECE 692 Topic Presentation 1."— Presentation transcript:

Similar presentations

About project

Feedback