Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Presentation 5 MAD MAC 525 22 nd February, 2006 Top Level Integration.

Similar presentations


Presentation on theme: "1 Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Presentation 5 MAD MAC 525 22 nd February, 2006 Top Level Integration."— Presentation transcript:

1 1 Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Presentation 5 MAD MAC 525 22 nd February, 2006 Top Level Integration W2 Project Objective: Design a crucial part of a GPU called the Multiply Accumulate Unit (MAC) which will revolutionize graphics. Design Manager: Zack Menegakis

2 2 MAD MAC 525 Status: Project chosen Specifications defined Architecture Design Behavioral Verilog Testbenches Verilog : Gate Level Design Floor plan Schematic  To be done  Layout (started)  Extraction, LVS, post-layout simulation

3 3 Multiply Add (MAD) / Multiply Accumulate Unit (MAC) Executes function AB+C on 16 bit floating point inputs Multiply and add in parallel to greatly speed up operation Rounding is only performed only once so greater accuracy than individual multiply and add functions. One circuit to rule them all! Recap - MAD MAC 525

4 4 Block Diagram RegArray ARegArray BRegArray C Multiplier Exp CalcAlign Adder/Subtractor Control Logic & Sign Dtrmin Normalize Round Reg Y Leading 0 Anticipator 10 5 5 5 14 35 22 5 4 36 14 10 1 5 5 Input Output 16

5 5 Multiplier Align C Reg A Reg B Exp Calc Reg C Pipeline Reg Adder Ld Zero Pipeline Reg Normalize Round Reg Y Floorplan

6 6 Design Decisions Adder – Variable length carry select adder Registers – Pulsed Latches Pass logic in shifters

7 7 Adder Schematic – Carry Select Variable length carry select adder Very regular – good compromise between speed and ease of layout 2.5ns delay through 37bits

8 8 Adder Schematic – 1 bit Carry Select

9 9 Pulsed Latch Advantage – Practically eliminates setup time 120ns Clock to ~Q delay (146 loaded) 16 transistors Simplified version of those used in the Pentium 4 Sizing does not seem to affect speed under load Clock pulse generator

10 10 More Pass Logic Compared different kinds of pass logic for shifters Transmission gates with buffers are the fastest Mux TypePropagation Delay (worst case) N-pass (Align) 78.32ps Transmission gate (Normalize) 50.5ps NAND81.22p

11 11 Transistor Count Area in um 2 Prop. Delay Power in mW (350MHz) Multiplier3500250004.43n8.86 Exponents700 5000942p1.608 Align530 3800480p1.031 Adder3700265003.24n4.58 Leading 0350 25002.05n0.232 Normalize900 6500430p2.291 Round300 20001.81n0.198 Registers1800 9000120p- Total 1178080300--

12 12 Design Goals – On target At least 300MHz – 600 MFLOPS Will be achievable through optimization and pipelining Pipeline stages not fully determined – 6 stages expected Multiplier will be pipelined to cut delay in half All other individual blocks can clock ~500MHz Faster adder is being developed. Not easily pipelined like multiplier – speed of this block will be the limiting factor for entire circuit

13 13 Top Level Schematic

14 14 Simulations: Normalize

15 15 Simulations: Align

16 16 Simulations: Multiplier

17 17 Problems Verilog simulation of circuit generated don’t cares after switching to new improved pass logic. Analog simulations work just fine Pass logic can be evil if done wrong. Multiplier initially ran at only 50MHz due to transmission gate XORs. Buffers solved the problem.

18 18 Questions?


Download ppt "1 Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Presentation 5 MAD MAC 525 22 nd February, 2006 Top Level Integration."

Similar presentations


Ads by Google