SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School.

SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School of Computer Science University of Central Florida

Scalable multi-threading Directory-based hardware DSM  Directory-based coherence: complex MCs  So complex that MCs can be programmable with embedded protocol processors Integrated memory controllers are common- place in high-end microprocessors  Servers are naturally NUMA/DSM, not SMP  Snooping is awkward and BW-limited This talk: build directory-based scalable DSM with nominal changes to standard MC

Two major goals Directory-based coherence without a directory controller  still scalable  can use less complex standard memory controllers Flexibility in using custom protocol code or any software sequences to do “interesting things” on cache misses  compression/encryption  fault tolerance

Outline Introducing SMTp Basic extensions for SMTp Deadlock avoidance Evaluation methodology Simulation results Related work Conclusions

Introducing SMTp SMTp: SMT with a protocol thread context Protocol thread executes the control part of the coherence protocol in parallel with SDRAM data access Provides flexibility to run custom software sequences on cache misses [motivation#1] Still uses the standard MC (no directory state machine) [motivation#2] Build large-scale directory-based DSM out of commodity nodes with integrated MC and SMTp

Outline Introducing SMTp Introducing SMTp  Basic extensions for SMTp Deadlock avoidance Evaluation methodology Simulation results Related work Conclusions

Basic extensions for SMTp INTEGRATED MEMORY CONTROLLER L2 CACHE L2 BB PPCV LA LDCTXT_ID ICFE IBB DERE IQ LSQ FPQ REG FILE ALU AGU FPU DC G DBB 1 bit 7 bits 16x64B 16x32B 16x128B App. MissProtocol Miss Uncached load/store L1 Miss

Memory controller for SMTp LOCAL MISS INTERFACE HANDLER DISPATCH NETWORK INTERFACE ADDR.HEADER SDRAM APP. DATA PROTOCOL DATA PPCV, LA LDCTXT_ID Uncached ld/st Protocol miss App. miss To/From Router NI Handler Local Miss Handler MissRefill NI In NI Out 8x128B

Enabling a protocol thread Statically bound to a thread context  Need an extra thread context (PC, RAS, register map)  No context switch Not visible to kernel Protocol code is provided by system (conventional DSM style) User cannot download arbitrary code to protocol memory

Anatomy of a protocol handler MIPS style RISC ISA Short sequence of instructions Calculate directory address // simple hash function. Load directory entry // normal cached load. Compute on header and directory // integer arithmetic. Send cache line/control message // uncached stores. switch r17// uncached load (header) ldctxtr18// uncached load (address)

Fetching from protocol thread NILMI HANDLER DISPATCH PPCPPCV ICFE ADDR.HEADER LSQ JUMP TABLE RouterFront side bus SDRAM

Fetching from protocol thread NILMI HANDLER DISPATCH PPCPPCV ICFE ADDR.HEADER LSQ JUMP TABLE RouterFront side bus Unblock switch SDRAM

Fetching from protocol thread NILMI HANDLER DISPATCH PPCPPCV ICFE ADDR.HEADER LSQ JUMP TABLE RouterFront side bus Execute ldctxt SDRAM

Fetching from protocol thread NILMI HANDLER DISPATCH PPCPPCV ICFE ADDR.HEADER LSQ JUMP TABLE RouterFront side bus SDRAM (at home)

Fetching from protocol thread Protocol code/data resides in unmapped portion of local SDRAM No ITLB access Share instruction cache with application thread(s) Fetcher turns off PPCV after the last handler instruction is fetched

Handling protocol load/store No DTLB access Share L1 data and L2 caches L2 cache miss from protocol thread behaves differently Needs to bypass Local Miss Interface Talks to local SDRAM directly

Outline Introducing SMTp Introducing SMTp Basic extensions for SMTp Basic extensions for SMTp  Deadlock avoidance Evaluation methodology Simulation results Related work Conclusions

Deadlock with shared resources Progress of app. L2 miss depends on progress of protocol thread Resources involved: front-end queue slots, branch stack space, integer registers, integer queue slots, LSQ slots, speculative store buffers, MSHRs, and cache index ROB LOAD Retire ptr. Allocate ptr. L2 miss Local miss handler Protocol instruction BLOCKED IQ Full

Solving resource deadlock General solution: one reserved instance Out of 8 decode queue slots app. threads get 7 while all 8 are open to protocol thread Easier solution: Pentium 4 style static resource partitioning Cache index conflict:  Solution: L1 and L2 bypass buffers (FA/LRU)  Allocate a bypass buffer entry instead  Parallel lookup: hit latency unchanged

SMTp: deadlock solution INTEGRATED MEMORY CONTROLLER L2 CACHE L2 BB PPCV LA LDCTXT_ID ICFE IBB DERE IQ LSQ FPQ REG FILE ALU AGU FPU DC G DBB 1 bit 7 bits 16x64B 16x32B 16x128B App. MissProtocol Miss Uncached load/store L1 Miss

Outline Introducing SMTp Introducing SMTp Basic extensions for SMTp Basic extensions for SMTp Deadlock avoidance Deadlock avoidance  Evaluation methodology Simulation results Related work Conclusions

Evaluation methodology Applications  SPLASH-2: FFT, LU, Radix, Ocean, Water  FFTW Simulated machine model (details in paper)  2GHz, 9 pipe stages  1, 2, 4 app. threads + one protocol context  ROB: 128 (per thread)  Integer/floating point registers: 160/192/256  L1 Icache: 32 KB/64B/2-way/LRU/1 cycle  L1 Dcache: 32 KB/32B/2-way/LRU/1 cycle  Unified L2: 2 MB/128B/8-way/LRU/9 cycles

Simulated machine models ModelMCPP MC, PP frequency Protocol D$ BaseNon-int.2-issue 400 MHz 512 KB DM IntPerfectInt.2-issue Proc. core Perfect Int512KBInt.2-issue ½ core 512 KB DM Int64KBInt.2-issue ½ core 64 KB DM SMTpInt.None ½ core None

Outline Introducing SMTp Introducing SMTp Basic extensions for SMTp Basic extensions for SMTp Deadlock avoidance Deadlock avoidance Evaluation methodology Evaluation methodology  Simulation results Related work Conclusions

Single node (1app,1prot) results

Single node (2app,1prot) results

Single node results: summary Memory controller integration helps Ocean and FFTW get maximum benefit LU and Water are largely insensitive SMTp is always faster than Base SMTp performs on par with Int512KB In a few cases Int512KB outperforms SMTp by at most 1.6% Int64KB suffers from directory cache misses  FFTW and Radix-Sort are most sensitive

32-node (1app,1prot) results

32-node (2app,1prot) results

Multi-node results: summary With increasing system size integrated models converge in terms of performance IntPerfect gets a slight edge due to double memory controller speed SMTp continues to deliver excellent performance The gap between Int512KB and SMTp: at most 6%, on average same

Resource occupancy: summary Protocol thread is active for very small amount of time (low protocol occupancy) When active, can have high peak resource occupancy When idle, all resources are freed except  31 mapped registers  2 LSQ slots holding switch and ldctxt Overall, protocol thread has very low pipeline overhead

Outline Introducing SMTp Introducing SMTp Basic extensions for SMTp Basic extensions for SMTp Deadlock avoidance Deadlock avoidance Evaluation methodology Evaluation methodology Simulation results Simulation results  Related work Conclusions

Related work Simultaneous multi-threading  Assisted execution [HPCA’01][MICRO’01][ISCA’02]  Fault tolerance [ASPLOS’00][ISCA’02]  User-level message passing [MTEAC’01] Programmable protocol engine  Customized co-processor (FLASH, S3.mp, STiNG, Piranha)  Commodity off-the-shelf processor (Typhoon)  On main processor through low overhead interrupt (Chalmers) [ISCA’95]

Outline Introducing SMTp Introducing SMTp Basic extensions for SMTp Basic extensions for SMTp Deadlock avoidance Deadlock avoidance Evaluation methodology Evaluation methodology Simulation results Simulation results Related work Related work  Conclusions

Conclusions First design to exploit SMT to run directory- based coherence protocol on spare threads Delivers performance close to (within 6%, average 0%) integrated coherence controllers with large (512 KB) stand-alone directory data caches Extremely low pipeline overhead SMTp provides an opportunity to build scalable directory-based DSMs with minor changes to commodity nodes

Future directions Need not be restricted to building DSMs out of commodity nodes only Use SMTp to carry out  On-the-fly compression/encryption of L2 cache lines  Software controlled address remapping to improve locality of cache access  Fault tolerance by selectively extending coherence protocols Alternate CMP design Issues with multiple protocol threads

SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School of Computer Science University of Central Florida

Protocol occupancy 16 nodes, (1a,1p) threads per node

Protocol thread characteristics 16 nodes, (1a,1p) threads per node

SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School.

Similar presentations

Presentation on theme: "SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School.

Similar presentations

Presentation on theme: "SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School."— Presentation transcript:

Similar presentations

About project

Feedback