UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.

UltraSparc IV Tolga TOLGAY

OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion

INTRODUCTION Sparc = Scalable Processor Architecture Open processor architecture SUN UltraSparc v9: RISC Architecture 64 bit address and data Superscalar Sparc = Scalable Processor Architecture Open processor architecture SUN UltraSparc v9: RISC Architecture 64 bit address and data Superscalar

HISTORY Begin developing Sparc – 1984 First Sparc Processor – 1986 SuperSparc – 1992 UltraSparc I – 1995 UltraSparc II – 1997 UltraSparc III – 2001 UltraSparc IV – 2004 UltraSparc IV – 2004 UltraSparc IV+ – 2005 UltraSparc T1 – 2005 Begin developing Sparc – 1984 First Sparc Processor – 1986 SuperSparc – 1992 UltraSparc I – 1995 UltraSparc II – 1997 UltraSparc III – 2001 UltraSparc IV – 2004 UltraSparc IV – 2004 UltraSparc IV+ – 2005 UltraSparc T1 – 2005

WHAT IS NEW? What UltraSparc IV offers new : CMT (Chip Multithreading) New registers added due to CMT enhancement MCU registers, Sun Fireplan Interconnect registers are shared. Enhancements on Floating Point Unit 16 MB L2 cache with 128 byte line-size shared by two processors. L2 caches uses LRU replacement strategy New write-cache indexing-hashing feature What UltraSparc IV offers new : CMT (Chip Multithreading) New registers added due to CMT enhancement MCU registers, Sun Fireplan Interconnect registers are shared. Enhancements on Floating Point Unit 16 MB L2 cache with 128 byte line-size shared by two processors. L2 caches uses LRU replacement strategy New write-cache indexing-hashing feature

Chip Multitreading (CMT) Two UltraSparc III cores into one die. Two mirrored cores share : System bus DRAM controller Off-die L2 cache Fireplan registers. Also called Chip Multiprocessing Two UltraSparc III cores into one die. Two mirrored cores share : System bus DRAM controller Off-die L2 cache Fireplan registers. Also called Chip Multiprocessing

Chip Multitreading

Aim is to increase performance without increasing clock speed. Mirroring the cores cause a hot spot of floating point units. How to avoid hot spot : Heat towers in copper interconnect Aim is to increase performance without increasing clock speed. Mirroring the cores cause a hot spot of floating point units. How to avoid hot spot : Heat towers in copper interconnect

Chip Multitreading

Core More core improvements: Improved instruction fetch and store bandwidth. Improved data prefetching FPU can handle more unexpected and underflow cases so reducing exceptions. On-die cache enhanced with a hashed index to better handle multiple writes. More core improvements: Improved instruction fetch and store bandwidth. Improved data prefetching FPU can handle more unexpected and underflow cases so reducing exceptions. On-die cache enhanced with a hashed index to better handle multiple writes.

Pipeline Because UltraSparc IV contains two UltraSparc III cores, it uses the same pipeline. 4-way superscalar architecture. 14-stage pipeline. Because UltraSparc IV contains two UltraSparc III cores, it uses the same pipeline. 4-way superscalar architecture. 14-stage pipeline.

Pipeline Stages

Pipeline StageDefinition AAddress Generation PPreliminary Fetch FFetch Intructions from I-Cache BBranch Target Computation IInstruction Group Formation JGrouping RRegister Access EExecute CCache MMiss Detect WWrite XExtend TTrap DDone

Pipeline Stages

Stage A : Address Generation Generates and selects the fetch address Address can be selected from several sources Stage P : Preliminary Fetch Starts fetching from I-Cache Accesses to Branch Predictor Stage F : Fetch Second half of I-Cache access At the end of stage 4 instructions may be latched Stage B : Branch Target Computation Analyzes the instructions Calculate branch target address Stage A : Address Generation Generates and selects the fetch address Address can be selected from several sources Stage P : Preliminary Fetch Starts fetching from I-Cache Accesses to Branch Predictor Stage F : Fetch Second half of I-Cache access At the end of stage 4 instructions may be latched Stage B : Branch Target Computation Analyzes the instructions Calculate branch target address

Pipeline Stages Stage I : Instruction Group Formation Instructions are grouped into instruction queue. Stage J : Instruction Group Staging A group of instructions are dequeued and sent to R-Stage Stage R : Dispatch and Register Access Dependency calculation Dependency solution Stage I : Instruction Group Formation Instructions are grouped into instruction queue. Stage J : Instruction Group Staging A group of instructions are dequeued and sent to R-Stage Stage R : Dispatch and Register Access Dependency calculation Dependency solution

Pipeline Stages Stage E : Integer Instruction Execution First stage of execution pipelines Integer instructions -> A0 and A1 pipelines Branch instructions -> Branch pipeline Other instructions -> MS pipeline Stage C : Cache Integer pipelines write results back SIU results are produced First stage for Floating Point Instructions Stage E : Integer Instruction Execution First stage of execution pipelines Integer instructions -> A0 and A1 pipelines Branch instructions -> Branch pipeline Other instructions -> MS pipeline Stage C : Cache Integer pipelines write results back SIU results are produced First stage for Floating Point Instructions

Pipeline Stages Stage M : Miss Data cache misses are determined Second step for FP instructions Stage W : Write MS pipeline results are written Third step for FP instructions D-cache miss requests send to L2 cache Stage X : Extend Final step for Floating Point instructions Results from FP instructions are ready for bypass Stage M : Miss Data cache misses are determined Second step for FP instructions Stage W : Write MS pipeline results are written Third step for FP instructions D-cache miss requests send to L2 cache Stage X : Extend Final step for Floating Point instructions Results from FP instructions are ready for bypass

Pipeline Stages Stage T : Trap Traps are signalled After trap, instructions invalidate results Stage D : Done Integer results are written into architectural register file Floating point results are written to floating point register file. Results became visible to any traps generated from younger instructions. Stage T : Trap Traps are signalled After trap, instructions invalidate results Stage D : Done Integer results are written into architectural register file Floating point results are written to floating point register file. Results became visible to any traps generated from younger instructions.

Pipeline Rules Grouping rules : Group : collection of instructions that does not limit eachother to be executed in parallel Made before R-stage Needed for : The execution order is maintained Each pipeline runs a subset of instructions Instructions may require helpers Execution order : in – order execution Grouping rules : Group : collection of instructions that does not limit eachother to be executed in parallel Made before R-stage Needed for : The execution order is maintained Each pipeline runs a subset of instructions Instructions may require helpers Execution order : in – order execution

Cache Organization Doubled cache size because of dual core. Data Cache : 64 KB x 2 Instruction Cache : 32 KB x 2 L2 Cache : 16 MB, off-chip, shared No L3 Cache Doubled cache size because of dual core. Data Cache : 64 KB x 2 Instruction Cache : 32 KB x 2 L2 Cache : 16 MB, off-chip, shared No L3 Cache

Cache Organization

Data Cache 64 KB Level 1 cache per core Instruction Cache 32 KB Level 1 cache per core 4 – way associative Data Cache 64 KB Level 1 cache per core Instruction Cache 32 KB Level 1 cache per core 4 – way associative

Cache Organization Prefetch Cache One of L1 caches 2 Kbyte SRAM : 32 x 64 bytes Uses LRU replacement algorithm Aim is to fetch data before needed Reduces main memory access latency 2 ports reads 8 bytes, 1 port writes 16 bytes per cycle. Hardware prefetch Prefetch Cache One of L1 caches 2 Kbyte SRAM : 32 x 64 bytes Uses LRU replacement algorithm Aim is to fetch data before needed Reduces main memory access latency 2 ports reads 8 bytes, 1 port writes 16 bytes per cycle. Hardware prefetch

Cache Organization Write Cache Reduces the bandwidth due to store traffic 2 Kbyte cache Handles multiprocessor and on-chip cache consistency Improves error recovery Optionally uses a hashed index Write Cache Reduces the bandwidth due to store traffic 2 Kbyte cache Handles multiprocessor and on-chip cache consistency Improves error recovery Optionally uses a hashed index

Cache Organization L2 Cache 16 MB SRAM shared by two processors Seperate L2 cache tags Two way set associative LRU replacement policy 128 bytes of line size UltraSparc IV+ has an on-die Level 2 cache with an off-die Level 3 cache L2 Cache 16 MB SRAM shared by two processors Seperate L2 cache tags Two way set associative LRU replacement policy 128 bytes of line size UltraSparc IV+ has an on-die Level 2 cache with an off-die Level 3 cache

Branch Prediction Branch Predictor : Small, single-cycle accessed SRAM Output is connected to P-stage Branch detemination is made in B-stage If miss, return to A-Stage. Branch Predictor : Small, single-cycle accessed SRAM Output is connected to P-stage Branch detemination is made in B-stage If miss, return to A-Stage.

Conclusion UltraSparc IV is a milestone as it is first dual core chip of UltraSparc family Sun continues to develop UltraSparc : UltraSparc IV+ UltraSparc T1 UltraSparc IV is a milestone as it is first dual core chip of UltraSparc family Sun continues to develop UltraSparc : UltraSparc IV+ UltraSparc T1

References UltraSparc IV User’s Manual, Sun Microsystems UltraSparc IV Whitepaper, Sun Microsystems UltraSparc IV Mirrors Predecessor, Kevin Krewell Implementation and Productization of a 4th Generation 1.8GHz Dual-Core SPARC V9 Microprocessor, Anand Dixit, Jason Hart,... UltraSparc III User’s Manual, Sun Microsystems UltraSparc IV User’s Manual, Sun Microsystems UltraSparc IV Whitepaper, Sun Microsystems UltraSparc IV Mirrors Predecessor, Kevin Krewell Implementation and Productization of a 4th Generation 1.8GHz Dual-Core SPARC V9 Microprocessor, Anand Dixit, Jason Hart,... UltraSparc III User’s Manual, Sun Microsystems

References Web Sites : http://web.cs.unlv.edu/cs219/group3/index.html http://bwrc.eecs.berkeley.edu/CIC/archive/cpu_ history.html#SPARC http://www.arcade-eu.org/overview/2005/ sparcIV.html http://www.top500.org/orsc/2006/sparcIV.htm http://www.sparc.org/history.html Web Sites : http://web.cs.unlv.edu/cs219/group3/index.html http://bwrc.eecs.berkeley.edu/CIC/archive/cpu_ history.html#SPARC http://www.arcade-eu.org/overview/2005/ sparcIV.html http://www.top500.org/orsc/2006/sparcIV.htm http://www.sparc.org/history.html

Questions...

UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.

Similar presentations

Presentation on theme: "UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.

Similar presentations

Presentation on theme: "UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History."— Presentation transcript:

Similar presentations

About project

Feedback