Presentation is loading. Please wait.

Presentation is loading. Please wait.

SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School.

Similar presentations


Presentation on theme: "SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School."— Presentation transcript:

1 SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School of Computer Science University of Central Florida

2 Scalable multi-threading Directory-based hardware DSM  Directory-based coherence: complex MCs  So complex that MCs can be programmable with embedded protocol processors Integrated memory controllers are common- place in high-end microprocessors  Servers are naturally NUMA/DSM, not SMP  Snooping is awkward and BW-limited This talk: build directory-based scalable DSM with nominal changes to standard MC

3 Two major goals Directory-based coherence without a directory controller  still scalable  can use less complex standard memory controllers Flexibility in using custom protocol code or any software sequences to do “interesting things” on cache misses  compression/encryption  fault tolerance

4 Outline Introducing SMTp Basic extensions for SMTp Deadlock avoidance Evaluation methodology Simulation results Related work Conclusions

5 Introducing SMTp SMTp: SMT with a protocol thread context Protocol thread executes the control part of the coherence protocol in parallel with SDRAM data access Provides flexibility to run custom software sequences on cache misses [motivation#1] Still uses the standard MC (no directory state machine) [motivation#2] Build large-scale directory-based DSM out of commodity nodes with integrated MC and SMTp

6 Outline Introducing SMTp Introducing SMTp  Basic extensions for SMTp Deadlock avoidance Evaluation methodology Simulation results Related work Conclusions

7 Basic extensions for SMTp INTEGRATED MEMORY CONTROLLER L2 CACHE L2 BB PPCV LA LDCTXT_ID ICFE IBB DERE IQ LSQ FPQ REG FILE ALU AGU FPU DC G DBB 1 bit 7 bits 16x64B 16x32B 16x128B App. MissProtocol Miss Uncached load/store L1 Miss

8 Memory controller for SMTp LOCAL MISS INTERFACE HANDLER DISPATCH NETWORK INTERFACE ADDR.HEADER SDRAM APP. DATA PROTOCOL DATA PPCV, LA LDCTXT_ID Uncached ld/st Protocol miss App. miss To/From Router NI Handler Local Miss Handler MissRefill NI In NI Out 8x128B

9 Enabling a protocol thread Statically bound to a thread context  Need an extra thread context (PC, RAS, register map)  No context switch Not visible to kernel Protocol code is provided by system (conventional DSM style) User cannot download arbitrary code to protocol memory

10 Anatomy of a protocol handler MIPS style RISC ISA Short sequence of instructions Calculate directory address // simple hash function. Load directory entry // normal cached load. Compute on header and directory // integer arithmetic. Send cache line/control message // uncached stores. switch r17// uncached load (header) ldctxtr18// uncached load (address)

11 Fetching from protocol thread NILMI HANDLER DISPATCH PPCPPCV ICFE ADDR.HEADER LSQ JUMP TABLE RouterFront side bus SDRAM

12 Fetching from protocol thread NILMI HANDLER DISPATCH PPCPPCV ICFE ADDR.HEADER LSQ JUMP TABLE RouterFront side bus SDRAM

13 Fetching from protocol thread NILMI HANDLER DISPATCH PPCPPCV ICFE ADDR.HEADER LSQ JUMP TABLE RouterFront side bus Unblock switch SDRAM

14 Fetching from protocol thread NILMI HANDLER DISPATCH PPCPPCV ICFE ADDR.HEADER LSQ JUMP TABLE RouterFront side bus Execute ldctxt SDRAM

15 Fetching from protocol thread NILMI HANDLER DISPATCH PPCPPCV ICFE ADDR.HEADER LSQ JUMP TABLE RouterFront side bus SDRAM (at home)

16 Fetching from protocol thread NILMI HANDLER DISPATCH PPCPPCV ICFE ADDR.HEADER LSQ JUMP TABLE RouterFront side bus SDRAM (at home)

17 Fetching from protocol thread Protocol code/data resides in unmapped portion of local SDRAM No ITLB access Share instruction cache with application thread(s) Fetcher turns off PPCV after the last handler instruction is fetched

18 Handling protocol load/store No DTLB access Share L1 data and L2 caches L2 cache miss from protocol thread behaves differently Needs to bypass Local Miss Interface Talks to local SDRAM directly

19 Outline Introducing SMTp Introducing SMTp Basic extensions for SMTp Basic extensions for SMTp  Deadlock avoidance Evaluation methodology Simulation results Related work Conclusions

20 Deadlock with shared resources Progress of app. L2 miss depends on progress of protocol thread Resources involved: front-end queue slots, branch stack space, integer registers, integer queue slots, LSQ slots, speculative store buffers, MSHRs, and cache index ROB LOAD Retire ptr. Allocate ptr. L2 miss Local miss handler Protocol instruction BLOCKED IQ Full

21 Solving resource deadlock General solution: one reserved instance Out of 8 decode queue slots app. threads get 7 while all 8 are open to protocol thread Easier solution: Pentium 4 style static resource partitioning Cache index conflict:  Solution: L1 and L2 bypass buffers (FA/LRU)  Allocate a bypass buffer entry instead  Parallel lookup: hit latency unchanged

22 SMTp: deadlock solution INTEGRATED MEMORY CONTROLLER L2 CACHE L2 BB PPCV LA LDCTXT_ID ICFE IBB DERE IQ LSQ FPQ REG FILE ALU AGU FPU DC G DBB 1 bit 7 bits 16x64B 16x32B 16x128B App. MissProtocol Miss Uncached load/store L1 Miss

23 Outline Introducing SMTp Introducing SMTp Basic extensions for SMTp Basic extensions for SMTp Deadlock avoidance Deadlock avoidance  Evaluation methodology Simulation results Related work Conclusions

24 Evaluation methodology Applications  SPLASH-2: FFT, LU, Radix, Ocean, Water  FFTW Simulated machine model (details in paper)  2GHz, 9 pipe stages  1, 2, 4 app. threads + one protocol context  ROB: 128 (per thread)  Integer/floating point registers: 160/192/256  L1 Icache: 32 KB/64B/2-way/LRU/1 cycle  L1 Dcache: 32 KB/32B/2-way/LRU/1 cycle  Unified L2: 2 MB/128B/8-way/LRU/9 cycles

25 Simulated machine models ModelMCPP MC, PP frequency Protocol D$ BaseNon-int.2-issue 400 MHz 512 KB DM IntPerfectInt.2-issue Proc. core Perfect Int512KBInt.2-issue ½ core 512 KB DM Int64KBInt.2-issue ½ core 64 KB DM SMTpInt.None ½ core None

26 Outline Introducing SMTp Introducing SMTp Basic extensions for SMTp Basic extensions for SMTp Deadlock avoidance Deadlock avoidance Evaluation methodology Evaluation methodology  Simulation results Related work Conclusions

27 Single node (1app,1prot) results

28 Single node (2app,1prot) results

29 Single node results: summary Memory controller integration helps Ocean and FFTW get maximum benefit LU and Water are largely insensitive SMTp is always faster than Base SMTp performs on par with Int512KB In a few cases Int512KB outperforms SMTp by at most 1.6% Int64KB suffers from directory cache misses  FFTW and Radix-Sort are most sensitive

30 32-node (1app,1prot) results

31 32-node (2app,1prot) results

32 Multi-node results: summary With increasing system size integrated models converge in terms of performance IntPerfect gets a slight edge due to double memory controller speed SMTp continues to deliver excellent performance The gap between Int512KB and SMTp: at most 6%, on average same

33 Resource occupancy: summary Protocol thread is active for very small amount of time (low protocol occupancy) When active, can have high peak resource occupancy When idle, all resources are freed except  31 mapped registers  2 LSQ slots holding switch and ldctxt Overall, protocol thread has very low pipeline overhead

34 Outline Introducing SMTp Introducing SMTp Basic extensions for SMTp Basic extensions for SMTp Deadlock avoidance Deadlock avoidance Evaluation methodology Evaluation methodology Simulation results Simulation results  Related work Conclusions

35 Related work Simultaneous multi-threading  Assisted execution [HPCA’01][MICRO’01][ISCA’02]  Fault tolerance [ASPLOS’00][ISCA’02]  User-level message passing [MTEAC’01] Programmable protocol engine  Customized co-processor (FLASH, S3.mp, STiNG, Piranha)  Commodity off-the-shelf processor (Typhoon)  On main processor through low overhead interrupt (Chalmers) [ISCA’95]

36 Outline Introducing SMTp Introducing SMTp Basic extensions for SMTp Basic extensions for SMTp Deadlock avoidance Deadlock avoidance Evaluation methodology Evaluation methodology Simulation results Simulation results Related work Related work  Conclusions

37 Conclusions First design to exploit SMT to run directory- based coherence protocol on spare threads Delivers performance close to (within 6%, average 0%) integrated coherence controllers with large (512 KB) stand-alone directory data caches Extremely low pipeline overhead SMTp provides an opportunity to build scalable directory-based DSMs with minor changes to commodity nodes

38 Future directions Need not be restricted to building DSMs out of commodity nodes only Use SMTp to carry out  On-the-fly compression/encryption of L2 cache lines  Software controlled address remapping to improve locality of cache access  Fault tolerance by selectively extending coherence protocols Alternate CMP design Issues with multiple protocol threads

39 SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School of Computer Science University of Central Florida

40 Protocol occupancy 16 nodes, (1a,1p) threads per node

41 Protocol thread characteristics 16 nodes, (1a,1p) threads per node


Download ppt "SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School."

Similar presentations


Ads by Google