Presentation is loading. Please wait.

Presentation is loading. Please wait.

June 30th, 2006 ICS’06 -- Håkan Zeffer: Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based.

Similar presentations


Presentation on theme: "June 30th, 2006 ICS’06 -- Håkan Zeffer: Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based."— Presentation transcript:

1 June 30th, 2006 ICS’06 -- Håkan Zeffer: Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based Memory Architecture TMA

2 June 30th, 2006 ICS’06 -- Håkan Zeffer: Simultaneous Multithreading (SMT)  Diminishing performance from ILP  Increased chip parallelism from hardware threading (TLP)  IBM Power5, Intel Pentium4, Sun T1 (Niagara)  “No processor should come without multiple threads” [Dr. Tremblay] fetch unit decode, rename etc. integer pipe floating-point pipe memory pipe branch pipe L1IL1D

3 June 30th, 2006 ICS’06 -- Håkan Zeffer: Chip Multiprocessors (CMPs) interconnect I D I D I D I D P P P P L2  Chip Multiprocessors (CMPs)  Piranha, IBM Power4, IBM Power5, Sun UltraSPARC IV+, Sun T1, Intel Duo, AMD Dual-Core Opteron

4 June 30th, 2006 ICS’06 -- Håkan Zeffer: Multi-CMP Systems CMP 3CMP 4 CMP 2CMP 1 interconnect I D I D I D I D P P P P L2  Larger systems sometimes built from multiple CMPs  Piranha, IBM Power4 and IBM Power5 interconnect

5 June 30th, 2006 ICS’06 -- Håkan Zeffer: Multi-CMP Coherence Inter-CMP Coherence Intra-CMP Coherence  Intra-CMP protocol for coherence within CMP  Inter-CMP protocol for coherence between CMPs  Interactions between protocols increase complexity CMP 3CMP 4 CMP 2CMP 1 interconnect

6 June 30th, 2006 ICS’06 -- Håkan Zeffer: Shared-Memory Trends  Today’s chips = yesterday’s mid-range servers  Sun T1 has 32 hardware threads on a single die  Is it worth to implement multi-CMP systems?  Increased development cost  Increased verification cost  How big is the market?

7 June 30th, 2006 ICS’06 -- Håkan Zeffer: Trap-Based Memory Architectures  TMA: Trap-based Memory Architecture  Basic idea  Optimize for commercial singe-chip performance  Let simple HW and SW support enable scalability  Coherence violation detection in hardware  Trap on inter-chip coherence violations  Solve inter-chip coherence misses in software

8 June 30th, 2006 ICS’06 -- Håkan Zeffer: Outline Introduction  TMA and TMA Lite  Evaluation methodology  Results  Related work  Future work  Conclusions

9 June 30th, 2006 ICS’06 -- Håkan Zeffer: TMA Lite  TMA Lite is a “minimal” TMA implementation  Runtime system Deadlock avoidance Coherence protocol  Per application “scalability”  Binary transparency  No memory system modifications  Simple processor core modifications  An inter-node load coherence check  An inter-node store coherence check

10 June 30th, 2006 ICS’06 -- Håkan Zeffer: A TMA Lite System  TMA Lite nodes  Single-chip system Load and store coherence check support  HW maintains intra-chip coherence  TMA Lite cluster network  “InfiniBand like”  High-bandwidth  Low-latency  Remote memory access (put, get and atomic)  TMA Lite software  Coherence and consistency between nodes

11 June 30th, 2006 ICS’06 -- Håkan Zeffer: The Load Check  Magic value convention  Each cache line in state invalid contains a predefined value  Hardware  Comparator at the load path detects this value  Trap generated when the value is found magic value register =? data & load check enabled? load trap? Controlled by system software  False misses  When the magic value is used within an application  Easy to detect and solve within the coherence protocol  Rare

12 June 30th, 2006 ICS’06 -- Håkan Zeffer: The Store Check  Write permission cache (WPC)  Can be seen as a very small cache  Operates on virtual addresses  Accessed in parallel with the data TLB  Write permission for lines in the WPC guaranteed by protocol trap? data TLB WPC Data L1 Address generation TLB access WPC access Start L1 access Tag compare TLB trap? WPC trap? End L1 access... hit? data  The write permission cache has to be filled  A fill occurs at all WPC misses  Even if the node already has write permission  Overhead often severe

13 June 30th, 2006 ICS’06 -- Håkan Zeffer: Simulator and Benchmarks  Simics: full-system simulator  Vasa: timing- and memory-model extension  Cycle accurate  Power5 like SMT processor model  Latency and bandwidth of caches, memory and network  SPLASH-2 benchmarks

14 June 30th, 2006 ICS’06 -- Håkan Zeffer: System Parameters  Scaled down Power5 chip  1 or 2 processor cores per chip  2 SMT threads per processor core  Write through L1  Write back L2 and L3 L2 on-die, L3 tags on-die  The HW distributed shared memory system  Directory: fully mapped bit vector, dedicated SRAM  Coherence protocol: HW, highly optimized, non-blocking  The TMA Lite system  Directory: fully mapped bit vector, in ordinary DRAM memory  Coherence protocol SW Binary patch to Solaris modifies the trap vector Coherence protocol run on the hardware thread that caused the miss

15 June 30th, 2006 ICS’06 -- Håkan Zeffer: Execution Time Breakdown Execution time is normalized to the HW DSM. 4 nodes, load comparator + 16 entry WPC.

16 June 30th, 2006 ICS’06 -- Håkan Zeffer: Coherence Protocol Breakdown

17 June 30th, 2006 ICS’06 -- Håkan Zeffer: SW Flexibility: Coherence Unit Size Execution time is normalized to the HW DSM. 4 nodes, load comparator + 16 entry WPC.

18 June 30th, 2006 ICS’06 -- Håkan Zeffer: Related Work  SW only  Page-based systems IVY, Munin, Cashmere, GeNIMA, Treadmarks + many more Virtual memory used for coherence detection  Fine-grained systems Shasta, Blizzard, Sirocco, DSZOOM Coherence checks instrumented into applications  HW support + software protocol  FLASH, Typhoon, S3.mp Coherence processor executes the coherence protocol  SMTp SMT thread executes the coherence protocol

19 June 30th, 2006 ICS’06 -- Håkan Zeffer: Future Work  More mature TMA implementations  Coherence detection on physical addresses  System (instead of application) scalability  (Proceedings figure text error: Internet pdf is OK!)  One proposal is already available as a tech. report  Available at:  New coherence detection scheme No “false” load or store coherence misses  A new way to decouple inter- and intra-chip coherence  In DRAM memory remote access caching  Commercial applications  Much more experiments  Very promising results

20 June 30th, 2006 ICS’06 -- Håkan Zeffer: Conclusions  Shared memory trends  SMT and CMP  Mid-range servers on a single chip  Trap-based Memory Architecture  Design for commercial single chip performance  Simple and small HW structures for scalable shared memory  TMA Lite  “Minimal” TMA implementation  Competitive to HW DSM when flexibility is used  Promising for HPC when runtime system is under control  Given the right HW/SW tradeoff simple and efficient scalable shared memory is possible  More mature TMA arch. in next paper (the tech. report)

21 June 30th, 2006 ICS’06 -- Håkan Zeffer: Questions?

22 June 30th, 2006 ICS’06 -- Håkan Zeffer: The Coherence Protocol


Download ppt "June 30th, 2006 ICS’06 -- Håkan Zeffer: Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based."

Similar presentations


Ads by Google