Slide #1February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Alpha 21164 Microprocessor The World’s Highest Performance Microprocessor Zhihui.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
ELEN 468 Advanced Logic Design
Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
Alpha Microarchitecture Onur/Aditya 11/6/2001.
THE AMD-K7 TM PROCESSOR Microprocessor Forum 1998 Dirk Meyer.
Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Chapter 12 Pipelining Strategies Performance Hazards.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Chapter 12 CPU Structure and Function. Example Register Organizations.
The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture ­Facilitate parallel execution ­Scale well with advancing.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.
Alpha 21364: A Scalable Single-chip SMP
The MIPS R10000 Superscalar Microprocessor Kenneth C. Yeager Nishanth Haranahalli February 11, 2004.
ALPHA Introduction I- Stream ALPHA Introduction I- Stream Dharmesh Parikh.
Using Interpretation for Profiling the Alpha 21264a Kip Walker Mike Burrows, Úlfar Erlingsson, Mark Vandevoorde, Carl Waldspurger, William.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
The original MIPS I CPU ISA has been extended forward three times The practical result is that a processor implementing MIPS IV is also able to run MIPS.
Microprocessor Microprocessor (cont..) It is a 16 bit μp has a 20 bit address bus can access upto 220 memory locations ( 1 MB). It can support.
DEC Alpha Course : CS 420 Student : Narith Kun Instructor : Dr. Chi-cheng Lin Date : April 26, 2010.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
EFLAG Register of The The only new flag bit is the AC alignment check, used to indicate that the microprocessor has accessed a word at an odd.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
The Alpha – Data Stream Matt Ziegler.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
PART 5: (1/2) Processor Internals CHAPTER 14: INSTRUCTION-LEVEL PARALLELISM AND SUPERSCALAR PROCESSORS 1.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
1 Computer Architecture & Assembly Language Spring 2001 Dr. Richard Spillman Lecture 24 –RISC II.
Instruction-Level Parallelism and Its Dynamic Exploitation
Protection in Virtual Mode
ALPHA Introduction I- Stream
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
ELEN 468 Advanced Logic Design
PowerPC 604 Superscalar Microprocessor
Case Studies MAINAK CS422 1 CS422 MAINAK CS422 MAINAK 1.
Introduction to Pentium Processor
Pipelining: Advanced ILP
The Microarchitecture of the Pentium 4 processor
Chapter 5 Memory CSE 820.
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Comparison of Two Processors
Alpha Microarchitecture
Lecture 20: OOO, Memory Hierarchy
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
* From AMD 1996 Publication #18522 Revision E
Chapter 11 Processor Structure and function
Presentation transcript:

Slide #1February 11, 1997 EECS Alpha Microprocessor Jerry Huang Alpha Microprocessor The World’s Highest Performance Microprocessor Zhihui Huang (Jerry) University of Michigan

Slide #2February 11, 1997 EECS Alpha Microprocessor Jerry Huang Historical Perspective  CISC and Digital VAX (~1980)  Serious exploration of RISC at Digital (1982)  Fragmented efforts on RISC (1983~1984) –SAFE, HR-32, CASCADE projects  First draft of the PRISM architecture (1985)  Cancellation of PRISM (1988)  First RISC workstation based on MIPS R2000(1989)  Rename PRISM to Alpha (1990)  First generation Alpha (1992)  Second generation Alpha (1994)

Slide #3February 11, 1997 EECS Alpha Microprocessor Jerry Huang Alpha Microprocessor Roadmap MHz MHz 21064A MHz MHz MHz MHz MHz MHz MHz Here We are

Slide #4February 11, 1997 EECS Alpha Microprocessor Jerry Huang The Architecture  64-bit load and store RISC architecture  Byte addressable –43-bit virtual address,40-bit physical address.  Integer Type: Byte,Word,Longword,Quadword.  Floating-Point Data Types –Longword integer format in floating-point unit –Quadword integer format in floating-point unit –IEEE and VAX floating-point format  CALL_PAL instruction

Slide #5February 11, 1997 EECS Alpha Microprocessor Jerry Huang Characteristics  m CMOS technology –4 layers of metalization –9.66 million transistor counts –14.4mm x 14.5mm die size (209mm 2 )  Package and Power –499-pin PGA, 291 signal pins –3.3v external,2.2v internal  Clock Frequency 300MHz ~ 500MHz –SPECint ~15.4 respectively –SPECfp ~21.1 respectively

Slide #6February 11, 1997 EECS Alpha Microprocessor Jerry Huang On-chip Cache Organization  An on-chip, 8KB primary instruction cache –direct mapped, 32-byte block(4 instructions) –virtual, 7-bit ASN(MAX_ASN = 127), 1-bit PALcode  An on-chip, 8KB primary data cache –dual-read-ported, single-write-ported –virtual indexed, physical tagged –write-through, read-allocate, direct mapped, 32-byte block  Large on-chip L2 cache –96 KB, 3-way set associative, physical –write-back, write-allocate, byte-accessible –32-byte(256-bit) or 64-byte(512-bit) block –mixed data and instruction cache –pipelined (16-byte per CPU cycle)

Slide #7February 11, 1997 EECS Alpha Microprocessor Jerry Huang TLB Organization  Instruction Translation Buffer –48-entry, fully associative –not-last-used replacement algorithm –8KB to 4MB page –2 superpages only in privileged mode  Data Translation Buffer –64-entry, fully associative –dual-read-ported –not-last-used replacement algorithm –superpage

Slide #8February 11, 1997 EECS Alpha Microprocessor Jerry Huang External Interface

Slide #9February 11, 1997 EECS Alpha Microprocessor Jerry Huang Alpha Block Diagram

Slide #10February 11, 1997 EECS Alpha Microprocessor Jerry Huang Microarchitecture Function units  Instruction fetch and decode unit(IBOX) –Instruction prefetcher and instruction decoder –Branch prediction –Instruction translation buffer (ITB) –Interrupt support  Integer execution unit (Ebox)  Floating-point execution unit(Fbox)  Memory address translation unit (Mbox) –Data Translation Buffer (DTB) –Miss Address File (MAF) –Write Buffer  Cache control and bus interface unit (Cbox)

Slide #11February 11, 1997 EECS Alpha Microprocessor Jerry Huang Instruction Issue Pipeline Organization Instruction CacheInstruction BufferInstruction SlotInstruction Issue S0S1S2S3

Slide #12February 11, 1997 EECS Alpha Microprocessor Jerry Huang S4 Execution Pipeline Organization S3 S5S6S7S8

Slide #13February 11, 1997 EECS Alpha Microprocessor Jerry Huang Memory Access Pipeline S4S5S6S7S8S10S9S12S11 S3S2S1 S2S1S0S4S3

Slide #14February 11, 1997 EECS Alpha Microprocessor Jerry Huang Instruction Latency

Slide #15February 11, 1997 EECS Alpha Microprocessor Jerry Huang Instruction Fetch/Issue Unit  Branch and Jump Prediction –2K entries Branch History Table (BHT) 2-bit saturate counter built into Icache not initialized on Icache fill –Does not limit the number of branch predictions –12-entry subroutine return stack store Icache index PALmode and user mode prediction –Mispredict trap 4 ~ 5 cycles penalty on branch mispredict

Slide #16February 11, 1997 EECS Alpha Microprocessor Jerry Huang Instruction Prefetch

Slide #17February 11, 1997 EECS Alpha Microprocessor Jerry Huang Instruction Decode/Issue  Decode upto 4 instructions in parallel  Check the structural hazard and data hazard  Issue only the instructions without hazard  Issue instructions IN ORDER  Handle only NATURALLY ALLIGNED groups of 4 instructions  Does not advance until all 4 instructions are done  No-op instruction is an important instruction

Slide #18February 11, 1997 EECS Alpha Microprocessor Jerry Huang No-op Instructions  Integer no-op –NOP (BIS R31,R31,R31)  Floating-point no-op –FNOP (CPYS F31,F31,F31)  Universal no-op –LDQ_U R31,...

Slide #19February 11, 1997 EECS Alpha Microprocessor Jerry Huang Code Analysis Bubble Sort Compiler Option: cc -newc -O4 -c -o bubble.o bubble.c

Slide #20February 11, 1997 EECS Alpha Microprocessor Jerry Huang Assembly Code in Groups(1)  1st Group (0x0) – –ldah gp, 1(t12) –lda gp, (gp) –lda sp, -48(sp) –cpys $f31,$f31,$f31  2nd Group (0x10) –ldq a2, (gp) –bis zero,ra,t11 –ldq t12, (gp) –bis zero, sp, a0  1st Pipeline States – t1 t2 t3 t4 t5 t6 t7 – s3 s4 s5 s6 –  2nd Pipeline States –t1 t2 t3 t4 t5 t6 t7 – s3 s4 s5 s6

Slide #21February 11, 1997 EECS Alpha Microprocessor Jerry Huang Assembly Code in Groups(2)  3rd Group (0x20) –stq zero, 0(sp) –bis zero, 0x28, a1 –bis zero, zero, t0 –jsr ra, (t12), _Ots  4th Group (0x30) –bis zero, 0x1, t1 –subq t1, 0xa, t3 –bge t3, 0x78 –bis zero, t1, t2  3rd Pipeline States –t5 t6 t7 t8 t9 –s3 s4 s5 s6  4th Pipeline States –t30 t31 t32 t33 t34 t35 –s3 s4 s5 s6

Slide #22February 11, 1997 EECS Alpha Microprocessor Jerry Huang Assembly Code in Groups(3)  5th Group (0x40) –s4addq t1, sp, t5 –lda t6, 36(sp) –s4addq t0, sp, t4 –lds $f0, 0(t4)  6th Group (0x50) –lds $f1, 0(t5) –cmptlt $f0, $f1, $f10 –fbeq $f10, 0x64 –sts $f1,0(t4)  5th Pipeline States –t33 t34 t35 t36 t37 t38 –s3 s4 s5 s6 – s3 s4 s5 s6 CV  6th Pipeline States – –s3 s4 s5 s6 – s3 s4 s5 s6 s7 s8 – s3 s4

Slide #23February 11, 1997 EECS Alpha Microprocessor Jerry Huang Assembly Code in Group (4)  7th Group (0x60) –sts $f0, 0(t5) –lda t5, 4(t5) –cmpule t5,t6,t9 –cpys $f31,$f31,$f31  8th Group (0x70) –addl t2, 0x1, t2 –bne t9, 0x4c –addl t1, 0x1, t1 –subq t1, 0xa, t10  7th Pipeline States – –s3 s4 s5 s6 –  8th Pipeline States – –s3 s4 s5 s6 – s0 s1 s2 s3 s4 s5 s6 – s0 s1 s2 s3 s4 s5

Slide #24February 11, 1997 EECS Alpha Microprocessor Jerry Huang Assembly Code in Groups (5)  9th Group (0x80) –addl t0, 0x1, t0 –blt t10, 0x34 –bis zero,t11,ra –bis zero,zero,v0 –lda sp, 48(sp) –ret zero, (ra), 1  9th Pipeline States – –s3 s4 s5 s6 – s0 s1 s2 s3 s4 – s0 s1 s2 s3

Slide #25February 11, 1997 EECS Alpha Microprocessor Jerry Huang I-box Good and Bad  Good –instructions prefetch –low latency and high clock rate  Bad –high branch mispredict penalty –in order issue –naturally alligned issue –no stall after stage 4, replay every time when needs stall

Slide #26February 11, 1997 EECS Alpha Microprocessor Jerry Huang E-box Good and Bad  Good –low execution latency and high clock rate –supporting various floating-point format  Bad –LOAD/STORE multiplexed into Integer unit –one more stage for floating-point pipeline  What else ?

Slide #27February 11, 1997 EECS Alpha Microprocessor Jerry Huang Memory Unit Overview  Two-level Data Cache and a 64-entry DTB  Memory Unit (Mbox) –Load instruction and Miss Address File(MAF) LDB,LDW,LDL,LDQ,LDL_L,LDQ_L,LDS,LDT –Store instruction and Write Buffer(WB) STB,STW,STL,STQ,STL_C,STQ_C,STS,STT –Memory Barrier (MB) –Write Memory Barrier (WMB)  Data Hazard and Replay Traps –Load After Store, Store After Store –MAF full and WB full

Slide #28February 11, 1997 EECS Alpha Microprocessor Jerry Huang Miss Address File  Hold Load Misses in 6 Entries –physical address –destination register –instruction types integer/floating-point 4-byte/8-byte/IEEE-S-Type/VAX-G-Type, etc.  Hold Instrction Fetch Address in 4 Entries –physical address

Slide #29February 11, 1997 EECS Alpha Microprocessor Jerry Huang Miss Address File Details  One on One Mapping ??? LDL R2, 0(R1) and LDL R3,0(R1)  Same Size ??? LDL R2,0(R1) and LDQ R3,8(R1)  Even with Even, Odd with Odd (LDL instruction only) ??? LDL R2,0(R1) and LDL R3,12(R1)  Integer with Integer, FP with FP ??? LDL R2,0(R1) and LDS FR2,8(R1)

Slide #30February 11, 1997 EECS Alpha Microprocessor Jerry Huang Data Hazard  Load after Store –(1 cycle later) Replay Trap (7 cycles penalty) –(2 cycles later) Issue Stalled –(Comliper Scheduled 3 cycles later) OK  Store after Load –Bits are set in each conflicting MAF entry to prevent its fill from being placed in the Dcache when it arrives, and to prevent subsequent load from merging. –Conflict bits are set with the store instruction in the write buffer to prevent the store instruction from being issued until all conflicting load instructions have been issued to Cbox

Slide #31February 11, 1997 EECS Alpha Microprocessor Jerry Huang M-box Good and Bad  Good –non blocking –2-level cache and large cache –merging for both load and store, reduce trafic –in order issue to the C-box and out of order completion  Bad –Replay every time when buffers are full, high penalty  What else ?

Slide #32February 11, 1997 EECS Alpha Microprocessor Jerry Huang Performance Characterization Percentage of time in PALcode

Slide #33February 11, 1997 EECS Alpha Microprocessor Jerry Huang Performance Characterization Distribution of issue cycles for the Alpha 21164

Slide #34February 11, 1997 EECS Alpha Microprocessor Jerry Huang Performance Characterization Branch mispredictions

Slide #35February 11, 1997 EECS Alpha Microprocessor Jerry Huang Performance Characterization Cache misses per thousand instructions on the Alpha 21164

Slide #36February 11, 1997 EECS Alpha Microprocessor Jerry Huang Reference  Hardware Reference Manual –Digital Semiconductor Alpha Microprocessor (order number : EC-QP99A-TE)  Alpha AXP Architecture Handbook –Digital Semiconductor (order number : EC-QD2KA-TE)  Alpha Implementations and Architecture –D. Bhandarkar, Digital Press, QA 76.8.A176 B471  Related materials for these slides –