Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide #1February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Alpha 21164 Microprocessor The World’s Highest Performance Microprocessor Zhihui.

Similar presentations


Presentation on theme: "Slide #1February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Alpha 21164 Microprocessor The World’s Highest Performance Microprocessor Zhihui."— Presentation transcript:

1

2 Slide #1February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Alpha 21164 Microprocessor The World’s Highest Performance Microprocessor Zhihui Huang (Jerry) University of Michigan

3 Slide #2February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Historical Perspective  CISC and Digital VAX (~1980)  Serious exploration of RISC at Digital (1982)  Fragmented efforts on RISC (1983~1984) –SAFE, HR-32, CASCADE projects  First draft of the PRISM architecture (1985)  Cancellation of PRISM (1988)  First RISC workstation based on MIPS R2000(1989)  Rename PRISM to Alpha (1990)  First generation Alpha 21064 (1992)  Second generation Alpha 21164 (1994)

4 Slide #3February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Alpha Microprocessor Roadmap 199219931994199619951997 5 10 15 20 21064 - 150 MHz 21064 - 200 MHz 21064A - 275 MHz 21164 - 300 MHz 21164 - 333 MHz 21164 - 366 MHz 21164 - 400 MHz 21164 - 433 MHz 21164 - 500 MHz Here We are

5 Slide #4February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang The 21164 Architecture  64-bit load and store RISC architecture  Byte addressable –43-bit virtual address,40-bit physical address.  Integer Type: Byte,Word,Longword,Quadword.  Floating-Point Data Types –Longword integer format in floating-point unit –Quadword integer format in floating-point unit –IEEE and VAX floating-point format  CALL_PAL instruction

6 Slide #5February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang 21164 Characteristics  m CMOS technology –4 layers of metalization –9.66 million transistor counts –14.4mm x 14.5mm die size (209mm 2 )  Package and Power –499-pin PGA, 291 signal pins –3.3v external,2.2v internal –37W@433MHz  Clock Frequency 300MHz ~ 500MHz –SPECint95 11.3~15.4 respectively –SPECfp95 14.5~21.1 respectively

7 Slide #6February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang On-chip Cache Organization  An on-chip, 8KB primary instruction cache –direct mapped, 32-byte block(4 instructions) –virtual, 7-bit ASN(MAX_ASN = 127), 1-bit PALcode  An on-chip, 8KB primary data cache –dual-read-ported, single-write-ported –virtual indexed, physical tagged –write-through, read-allocate, direct mapped, 32-byte block  Large on-chip L2 cache –96 KB, 3-way set associative, physical –write-back, write-allocate, byte-accessible –32-byte(256-bit) or 64-byte(512-bit) block –mixed data and instruction cache –pipelined (16-byte per CPU cycle)

8 Slide #7February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang TLB Organization  Instruction Translation Buffer –48-entry, fully associative –not-last-used replacement algorithm –8KB to 4MB page –2 superpages only in privileged mode  Data Translation Buffer –64-entry, fully associative –dual-read-ported –not-last-used replacement algorithm –superpage

9 Slide #8February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang External Interface

10 Slide #9February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Alpha 21164 Block Diagram

11 Slide #10February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Microarchitecture Function units  Instruction fetch and decode unit(IBOX) –Instruction prefetcher and instruction decoder –Branch prediction –Instruction translation buffer (ITB) –Interrupt support  Integer execution unit (Ebox)  Floating-point execution unit(Fbox)  Memory address translation unit (Mbox) –Data Translation Buffer (DTB) –Miss Address File (MAF) –Write Buffer  Cache control and bus interface unit (Cbox)

12 Slide #11February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Instruction Issue Pipeline Organization Instruction CacheInstruction BufferInstruction SlotInstruction Issue S0S1S2S3

13 Slide #12February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang S4 Execution Pipeline Organization S3 S5S6S7S8

14 Slide #13February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Memory Access Pipeline S4S5S6S7S8S10S9S12S11 S3S2S1 S2S1S0S4S3

15 Slide #14February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Instruction Latency

16 Slide #15February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Instruction Fetch/Issue Unit  Branch and Jump Prediction –2K entries Branch History Table (BHT) 2-bit saturate counter built into Icache not initialized on Icache fill –Does not limit the number of branch predictions –12-entry subroutine return stack store Icache index PALmode and user mode prediction –Mispredict trap 4 ~ 5 cycles penalty on branch mispredict

17 Slide #16February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Instruction Prefetch

18 Slide #17February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Instruction Decode/Issue  Decode upto 4 instructions in parallel  Check the structural hazard and data hazard  Issue only the instructions without hazard  Issue instructions IN ORDER  Handle only NATURALLY ALLIGNED groups of 4 instructions  Does not advance until all 4 instructions are done  No-op instruction is an important instruction

19 Slide #18February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang No-op Instructions  Integer no-op –NOP (BIS R31,R31,R31)  Floating-point no-op –FNOP (CPYS F31,F31,F31)  Universal no-op –LDQ_U R31,...

20 Slide #19February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Code Analysis Bubble Sort Compiler Option: cc -newc -O4 -c -o bubble.o bubble.c

21 Slide #20February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Assembly Code in Groups(1)  1st Group (0x0) – –ldah gp, 1(t12) –lda gp, -32528(gp) –lda sp, -48(sp) –cpys $f31,$f31,$f31  2nd Group (0x10) –ldq a2, -32752(gp) –bis zero,ra,t11 –ldq t12, -32744(gp) –bis zero, sp, a0  1st Pipeline States – t1 t2 t3 t4 t5 t6 t7 – s3 s4 s5 s6 – -- -- -- --  2nd Pipeline States –t1 t2 t3 t4 t5 t6 t7 – s3 s4 s5 s6

22 Slide #21February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Assembly Code in Groups(2)  3rd Group (0x20) –stq zero, 0(sp) –bis zero, 0x28, a1 –bis zero, zero, t0 –jsr ra, (t12), _Ots  4th Group (0x30) –bis zero, 0x1, t1 –subq t1, 0xa, t3 –bge t3, 0x78 –bis zero, t1, t2  3rd Pipeline States –t5 t6 t7 t8 t9 –s3 s4 s5 s6  4th Pipeline States –t30 t31 t32 t33 t34 t35 –s3 s4 s5 s6

23 Slide #22February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Assembly Code in Groups(3)  5th Group (0x40) –s4addq t1, sp, t5 –lda t6, 36(sp) –s4addq t0, sp, t4 –lds $f0, 0(t4)  6th Group (0x50) –lds $f1, 0(t5) –cmptlt $f0, $f1, $f10 –fbeq $f10, 0x64 –sts $f1,0(t4)  5th Pipeline States –t33 t34 t35 t36 t37 t38 –s3 s4 s5 s6 – s3 s4 s5 s6 CV  6th Pipeline States –35 36 37 38 39 40 41 42 43 44 –s3 s4 s5 s6 – s3 s4 s5 s6 s7 s8 – s3 s4

24 Slide #23February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Assembly Code in Group (4)  7th Group (0x60) –sts $f0, 0(t5) –lda t5, 4(t5) –cmpule t5,t6,t9 –cpys $f31,$f31,$f31  8th Group (0x70) –addl t2, 0x1, t2 –bne t9, 0x4c –addl t1, 0x1, t1 –subq t1, 0xa, t10  7th Pipeline States –44 45 46 47 48 –s3 s4 s5 s6 – -- -- -- --  8th Pipeline States –46 47 48 49 50 51 52 –s3 s4 s5 s6 – s0 s1 s2 s3 s4 s5 s6 – s0 s1 s2 s3 s4 s5

25 Slide #24February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Assembly Code in Groups (5)  9th Group (0x80) –addl t0, 0x1, t0 –blt t10, 0x34 –bis zero,t11,ra –bis zero,zero,v0 –lda sp, 48(sp) –ret zero, (ra), 1  9th Pipeline States –52 52 54 55 56 57 58 –s3 s4 s5 s6 – s0 s1 s2 s3 s4 – s0 s1 s2 s3

26 Slide #25February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang I-box Good and Bad  Good –instructions prefetch –low latency and high clock rate  Bad –high branch mispredict penalty –in order issue –naturally alligned issue –no stall after stage 4, replay every time when needs stall

27 Slide #26February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang E-box Good and Bad  Good –low execution latency and high clock rate –supporting various floating-point format  Bad –LOAD/STORE multiplexed into Integer unit –one more stage for floating-point pipeline  What else ?

28 Slide #27February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Memory Unit Overview  Two-level Data Cache and a 64-entry DTB  Memory Unit (Mbox) –Load instruction and Miss Address File(MAF) LDB,LDW,LDL,LDQ,LDL_L,LDQ_L,LDS,LDT –Store instruction and Write Buffer(WB) STB,STW,STL,STQ,STL_C,STQ_C,STS,STT –Memory Barrier (MB) –Write Memory Barrier (WMB)  Data Hazard and Replay Traps –Load After Store, Store After Store –MAF full and WB full

29 Slide #28February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Miss Address File  Hold Load Misses in 6 Entries –physical address –destination register –instruction types integer/floating-point 4-byte/8-byte/IEEE-S-Type/VAX-G-Type, etc.  Hold Instrction Fetch Address in 4 Entries –physical address

30 Slide #29February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Miss Address File Details  One on One Mapping ??? LDL R2, 0(R1) and LDL R3,0(R1)  Same Size ??? LDL R2,0(R1) and LDQ R3,8(R1)  Even with Even, Odd with Odd (LDL instruction only) ??? LDL R2,0(R1) and LDL R3,12(R1)  Integer with Integer, FP with FP ??? LDL R2,0(R1) and LDS FR2,8(R1)

31 Slide #30February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Data Hazard  Load after Store –(1 cycle later) Replay Trap (7 cycles penalty) –(2 cycles later) Issue Stalled –(Comliper Scheduled 3 cycles later) OK  Store after Load –Bits are set in each conflicting MAF entry to prevent its fill from being placed in the Dcache when it arrives, and to prevent subsequent load from merging. –Conflict bits are set with the store instruction in the write buffer to prevent the store instruction from being issued until all conflicting load instructions have been issued to Cbox

32 Slide #31February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang M-box Good and Bad  Good –non blocking –2-level cache and large cache –merging for both load and store, reduce trafic –in order issue to the C-box and out of order completion  Bad –Replay every time when buffers are full, high penalty  What else ?

33 Slide #32February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Performance Characterization Percentage of time in PALcode

34 Slide #33February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Performance Characterization Distribution of issue cycles for the Alpha 21164

35 Slide #34February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Performance Characterization Branch mispredictions

36 Slide #35February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Performance Characterization Cache misses per thousand instructions on the Alpha 21164

37 Slide #36February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Reference  Hardware Reference Manual –Digital Semiconductor 21164 Alpha Microprocessor (order number : EC-QP99A-TE)  Alpha AXP Architecture Handbook –Digital Semiconductor (order number : EC-QD2KA-TE)  Alpha Implementations and Architecture –D. Bhandarkar, Digital Press, QA 76.8.A176 B471  Related materials for these slides –http://umaxp1.physics.lsa.umich.edu/~zhihuang


Download ppt "Slide #1February 11, 1997 EECS 598 ---- Alpha Microprocessor Jerry Huang Alpha 21164 Microprocessor The World’s Highest Performance Microprocessor Zhihui."

Similar presentations


Ads by Google