Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang.

Similar presentations


Presentation on theme: "The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang."— Presentation transcript:

1 The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang Electrical & Computer Engineering

2 The Thrifty Barrier – Li, Martínez, and Huang Motivation  Multiprocessor architectures sprouting everywhere large compute servers small servers, desktops chip multiprocessors  High energy consumption a problem – more so in MPs  Most power-aware techniques tailored at uniprocessors  Multiprocessors present unique challenges processor co-ordination, synchronization

3 The Thrifty Barrier – Li, Martínez, and Huang Case: Barrier Synchronization  Fast threads spin-wait for slower ones  Spin-wait wasteful by definition quick reaction but only last iteration useful spin-wait compute

4 The Thrifty Barrier – Li, Martínez, and Huang Proposal: Thrifty Barrier  Reduce spin-wait energy waste in barriers leverage existing processor sleep states (e.g. ACPI)  Minimize impact on execution time achieve timely wake-up conventionalthrifty

5 The Thrifty Barrier – Li, Martínez, and Huang Challenges  Should sleep? transition times (sleep + wake-up) non-negligible  What sleep state? more energy savings → longer transition times  When to wake up? early w.r.t. barrier release → may hurt energy savings late w.r.t. barrier release → may hurt performance Must predict barrier stall time accurately

6 The Thrifty Barrier – Li, Martínez, and Huang Findings  Many barrier stall times large enough to leverage sleep states  Stall times predictable discriminate through PC indexing predict indirectly using barrier interval times  Timely wake-up: combination of two mechanisms coherence message bounds wake-up latency watchdog timer anticipates wake-up

7 The Thrifty Barrier – Li, Martínez, and Huang Thrifty Barrier Mechanism BARRIER ARRIVAL SLEEP? S1S2S3 Wake-up signal RESIDUAL SPIN No BARRIER DEPARTURE Stall time prediction

8 The Thrifty Barrier – Li, Martínez, and Huang Sleep Mechanism BARRIER ARRIVAL SLEEP? S1S2S3 Wake-up signal RESIDUAL SPIN No BARRIER DEPARTURE Stall time prediction

9 The Thrifty Barrier – Li, Martínez, and Huang Predicting Stall Time  Splash-2’s FMM example: 3 important barriers, 4 iterations randomly picked thread (always the same)  PC indexing reduces variability  Interval time (BIT) more stable metric than stall time (BST)

10 The Thrifty Barrier – Li, Martínez, and Huang Stall Time vs. Interval Time  Barriers separate computation phases PC indexing reduces variability  Barrier stall time (BST) varies considerably even with PC indexing barrier-, but also thread-dependent – computation shifts among threads across invocations  Barrier interval time (BIT) varies much less quite stable if PC indexing used barrier-, but not thread-dependent last-value prediction ok for most applications

11 The Thrifty Barrier – Li, Martínez, and Huang Predicting Stall Time Indirectly  Can use BIT to predict BST indirectly compute time measurable upon arrival to barrier subtract from predicted BIT to derive predicted BST  How to manage time info? BIT BST t Compute t

12 The Thrifty Barrier – Li, Martínez, and Huang  Threads depart from barrier instance b-1 toward instance b  Each thread t has local record of release timestamp BRTS t,b-1  Assumptions: no global clock local wallclock active even if CPU sleeps – all CPUs same nominal clock frequency Managing Time Info b-1b BRTS t,b-1

13 The Thrifty Barrier – Li, Martínez, and Huang  Thread t arrives, knowing BRTS t,b-1, Compute t,b make prediction pBIT b derive pBST t,b = pBIT b – Compute t,b use pBST t,b to pick sleep state (if warranted) – best fit based on transition time Managing Time Info b-1b pBIT b pBST t,b Compute t,b BRTS t,b-1

14 The Thrifty Barrier – Li, Martínez, and Huang  Last thread u arrives, knowing BRTS u,b-1 derive actual BIT b = time( ) – BRTS u,b-1 update (shared) predictor with BIT b release barrier Managing Time Info b-1b BIT b BRTS u,b-1

15 The Thrifty Barrier – Li, Martínez, and Huang  Every thread t (possibly after waking up late) read BIT b from updated predictor compute actual BRTS t,b = BRTS t,b-1 + BIT b  Threads never use timestamps (BRTS) from other threads no global clock is needed Managing Time Info b-1b BIT b BRTS t,b-1 BRTS t,b *

16 The Thrifty Barrier – Li, Martínez, and Huang Thrifty Barrier Mechanism BARRIER ARRIVAL SLEEP? S1S2S3 Wake-up signal RESIDUAL SPIN No BARRIER DEPARTURE Stall time prediction

17 The Thrifty Barrier – Li, Martínez, and Huang Wake-up Mechanism BARRIER ARRIVAL SLEEP? S1S2S3 Wake-up signal RESIDUAL SPIN No BARRIER DEPARTURE Stall time prediction

18 The Thrifty Barrier – Li, Martínez, and Huang Wake-up Mechanism  Communicate barrier completion to sleeping CPUs signal sent to CPU pin options: external vs. internal wake-up  External (passive): initiated by processor that releases barrier leverage coherence protocol – invalidation to spinlock must supply spinlock address to cache controller  Internal (active): triggered by watchdog timer program with predicted BST before going to sleep

19 The Thrifty Barrier – Li, Martínez, and Huang Early vs. Late Wake-up  Early wake-up (underprediction) energy waste – residual spin  Late wake-up (overprediction) possible impact on execution time  External wake-up guarantees late wake-up (but bounded)  Internal wake-up can lead to both (late not bounded)  Our approach: hybrid wake-up external provides upper bound internal strives for timely wake-up using prediction

20 The Thrifty Barrier – Li, Martínez, and Huang Other Considerations (see paper)  Sleep states that do not snoop for coherence requests flush dirty data before sleeping defer invalidations to clean data  Overprediction threshold case of frequent, swinging BITs of modest size turn off prediction if overpredict beyond threshold  Interaction with context switching and I/O underprediction threshold  Time sharing issues: multiprogramming, overthreading

21 The Thrifty Barrier – Li, Martínez, and Huang Experimental Setup  Simultated system: 64-node CC-NUMA 6-way dynamic superscalar L1 16KB 64B 2-way 2clk; L2 64KB 64B 8-way 12clk 16B/4clk memory bus, 60ns SDRAM hypercube, wormhole, 4clk pipelined routers – 16clk pin to pin  Energy modeling: Wattch (CPU + L1 + L2) sleep states along lines of Pentium family

22 The Thrifty Barrier – Li, Martínez, and Huang Experimental Setup  All Splash-2 applications except: Raytrace – no barriers LU – better version w/o barriers widely available  Efficiency (64p) 40-82%, avg. 58% Target Group ≥ 10%

23 The Thrifty Barrier – Li, Martínez, and Huang Energy Savings

24 The Thrifty Barrier – Li, Martínez, and Huang Performance Impact

25 The Thrifty Barrier – Li, Martínez, and Huang Related Work Highlights  Quite a bit of work in uniprocessor domain  Elnozahy et al. server farms, clusters – thirfty barrier targets shared memory, parallel apps.  Moshovos et al., Saldanha and Lipasti energy-aware cache coherence – prob. compatible with and complementary to thrifty barrier

26 The Thrifty Barrier – Li, Martínez, and Huang Conclusions  Energy-aware MP mechanisms can and should be pursued  Case of energy-aware barrier synchronization simple indirect prediction of barrier stall time hybrid wake-up scheme to minimize impact on exec. time  Encouraging results; target applications 17% avg. energy savings 2% avg. performance impact

27 The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang Electrical & Computer Engineering


Download ppt "The Thrifty Barrier Energy-Aware Synchronization in Shared-Memory Multiprocessors Jian Li and José F. Martínez Computer Systems Laboratory Michael C. Huang."

Similar presentations


Ads by Google