Presentation is loading. Please wait.

Presentation is loading. Please wait.

ECE Dept., Univ. Maryland, College Park

Similar presentations


Presentation on theme: "ECE Dept., Univ. Maryland, College Park"— Presentation transcript:

1 ECE Dept., Univ. Maryland, College Park
TRANSPARENT THREADS Gautham K.Dorai and Dr.Donald Yeung ECE Dept., Univ. Maryland, College Park

2 SMT Processors ICOUNT, IQPOSN, MISSCOUNT, BRCOUNT [Tullsen ISCA’96]
Priority Mechanisms Pipeline Multiple Threads ICOUNT, IQPOSN, MISSCOUNT, BRCOUNT [Tullsen ISCA’96]

3 Individual Threads run Slower!
Individual Thread Performance 31%

4 Single-Thread Performance
Multiprogramming (Process Scheduling) Subordinate Threading (Prefetching/Pre-Execution,Cache management, Branch Prediction etc.) Performance Monitoring (Dynamic Profiling)

5 Transparent Threads 0% SLOWDOWN Foreground Thread
Background Thread (Transparent) 0% SLOWDOWN

6 Single-Thread Performance
Multiprogramming Latency of critical high-priority process Subordinate Threading Performance Monitoring Benefit vs Cost (Overhead) Tradeoff

7 Road Map Motivation Transparent Threads Experimental Evaluation
Transparent Software Prefetching Conclusion

8 Transparency – No stealing of shared resources
Shared vs. Private Transparency – No stealing of shared resources PC ROB Predictor Functional Units Issue Queues Register File I-Cache Register Map D-Cache Fetch Queue

9 Slots, Buffers and Memories
SLOTS – Allocation based on current cycle only BUFFERS – Allocation based on future cycles PC PC MEMORIES – Allocation based on future cycles ROB Predictor Issue Units Issue Queues Register File I-Cache I-Cache I-Cache Register Map D-Cache Fetch Queue

10 ICOUNT (background) = ICOUNT(background) + Inst-Window Size
Slot Prioritization ICOUNT (background) = ICOUNT(background) + Inst-Window Size Foreground Background I-CACHE BLOCK ICOUNT 2.N Fetch Slots

11 ICOUNT (background) = ICOUNT(background) + Inst-Window Size
Slot Prioritization ICOUNT (background) = ICOUNT(background) + Inst-Window Size Foreground Background I-CACHE BLOCK ICOUNT 2.N Fetch Slots

12 ICOUNT (background) = ICOUNT(background) + Inst-Window Size
Slot Prioritization ICOUNT (background) = ICOUNT(background) + Inst-Window Size Foreground Background I-CACHE BLOCK ICOUNT 2.N Fetch Slots

13 Buffer Transparency Foreground Background Fetch Hardware Fetch Queue
ROB Fetch Hardware PC1 PC2 Fetch Queue Foreground Background Issue Queue

14 Background Thread Window Partitioning
Limit on Background Thread Instructions ROB Stops fetch when ICOUNT reaches limit Fetch Hardware PC1 PC2 Partition Fetch Queue Foreground Background Issue Queue

15 Background Thread Window Partitioning
ROB Foreground Thread can occupy all available entries Fetch Hardware PC1 PC2 Partition Fetch Queue Foreground Background Issue Queue

16 Background Thread Flushing
ROB No limit on Background Thread Head Tail Fetch Hardware PC1 PC2 Tail Fetch Queue Foreground Background Issue Queue Head

17 Background Thread Flushing
ROB No limit on Background Thread Head Flush Triggered on Conflict Tail Fetch Hardware PC1 PC2 Fetch Queue Foreground Background Tail Issue Queue Head

18 Background Thread Flushing
ROB No limit on Background Thread Head Flush Triggered on Conflict Fetch Hardware Tail PC1 PC2 Fetch Queue Foreground Background Tail Issue Queue Head

19 Foreground Thread Flushing
ROB Instructions remain stagnant in ROB Load Miss Head Flush Triggered on load miss at head Flush Stagnated Entries Fetch Hardware PC1 PC2 Fetch Queue Foreground Background Issue Queue Tail

20 Foreground Thread Flushing
ROB Load Miss Head Flush F Entries from the tail Block the fetch for T Cycles Fetch Hardware PC1 PC2 Tail Fetch Queue Head Foreground Background Issue Queue

21 Foreground Thread Flushing
ROB Load Miss Head After T Cycles allow to fetch again F & T depend on R (Residual Cache Latency) Fetch Hardware PC1 PC2 Fetch Queue Foreground Background Tail Issue Queue

22 SimpleScalar-based SMT
Number of Contexts 4 Issue Width 8 IFQ Size 32 Int/FP Issue Queues ROB Size 128 Branch Predictor G-Share + Bimodal L1 Cache Split - 32K (4-way) L2 Unified Cache 512K (4-way) L1/L2/Memory Latency 1/10/122 cycles

23 Benchmark Suites Evaluate Transparency Mechanisms
Name Type VPR SPECInt 2000 BZIP GZIP EQUAKE ART SPECfp 2000 GAP AMMP IRREG PDE Solver Evaluate Transparency Mechanisms Transparent Software Prefetching

24 Transparency Mechanisms
Background Thread Window Partitioning (32 Entries) Slot Prioritization Background Thread Flushing Private Caches Equal Priority Private Predictor EP SP BP BF PC PP

25 Transparency Mechanisms
Equal Priority – 30% Slowdown Slot Prioritization – 16% Slowdown Background Window Partitioning – 9% Slowdown Background Thread Flushing – 3% Slowdown EP SP BP BF PC PP

26 Performance Mechanisms
ICOUNT 2.8 with Flushing Foreground Thread Window Partitioning (112F + 32B) Equal Priority ICOUNT 2.8 2B 2F 2P EP

27 Performance Mechanisms
Equal Priority – 31% degradation ICOUNT % slower than EP ICOUNT Foreground Thread Flushing – 23% slower than EP Foreground Thread Window Partitioning – 13% slower than EP 2B 2F 2P EP Normalized IPC

28 Transparent Software Prefetching
Conventional Transparent Software Prefetching Computation Thread Transparent Prefetch Thread For (I=0; I < N-PD; I+=8) { prefetch(&b[I]); b[I] = z + b[I]; } For (I=0; I < N-PD; I+=8) { b[I] = z + b[I]; } For (I=0; I < N-PD; I+=8) { prefetch(&b[I]); } In-lined Prefetch Code Offload Prefetch Code to Transparent Threads Profitability of Prefetching Zero Overhead – No profiling Required Benefit vs Cost tradeoff (Profiling required)

29 Transparent Software Prefetching
Naive Conventional Software Prefetching Profiled Conventional Software Prefetching No Prefetching Transparent Software Prefetching Normalized Execution Time NP PF PS TSP VPR

30 Transparent Software Prefetching
Naïve Software Prefetching – 19.6% Overhead, 0.8% Performance Selective Software Prefetching – 14.13% Overhead, 2.47% Performance Transparent Software Prefetching – 1.38% Overhead, 9.52% Performance NP PF PS TSP NP PF PS TSP NP PF PS TSP NP PF PS TSP NP PF PS TSP NP PF PS TSP NP PF PS TSP VPR BZIP GAP EQUAKE ART AMMP IRREG

31 Conclusions Transparency Mechanisms Throughput Mechanisms
3% overhead on foreground thread Less than 1% without cache and predictor contention Throughput Mechanisms Within 23% of Equal Priority Transparent Software Prefetching 9.52% gain with 1.38% Overhead Eliminates the need for profiling Availability of spare bandwidth Can be used transparently for interesting applications

32 Related Work Tullsen’s work on Flushing mechanisms
[Tullsen Micro-2001] Raasch’s work on prioritization [Raasch MTEAC Worshop 1999] Snavely’s work on Job Scheduling [Snavely ICMM-2001] Chappell’s work on Subordinate Multithreading and Dubois’s work on Assisted Execution [Chappell ISCA-1999][Dubois Tech-Report Oct’98]

33 Foreground Thread Window Partitioning
Advantages Minimal guaranteed entries Disadvantages Transparency minimized Fetch Hardware PC1 PC2 Partition Fetch Queue Foreground Background Issue Queue

34 Benchmark Suites Name Type ROB Occ. VPR SPECInt 2000 Medium BZIP High
GZIP Low EQUAKE ART SPECfp 2000 GAP AMMP IRREG PDE Solver Evaluate Transparency Mechanisms Transparent Software Prefetching

35 Transparency Mechanisms
EP SP BF EP SP BF EP SP BF EP SP BF EP SP EP SP BF EP SP BF EP SP BF EP SP BF

36 Transparency Mechanisms
EP SP BF EP SP BF EP SP BF EP SP BF SP BF EP SP BF EP SP BF EP SP BF EP SP BF

37 Transparency Mechanisms

38 Transparent Software Prefetching
NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF NP PF PS TSP NF


Download ppt "ECE Dept., Univ. Maryland, College Park"

Similar presentations


Ads by Google