Presentation is loading. Please wait.

Presentation is loading. Please wait.

Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems.

Similar presentations


Presentation on theme: "Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems."— Presentation transcript:

1 Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems

2 Agenda  Industry Trends  DSP Architecture  DSP Micro-Architecture  DSP Systems

3 Agenda  Industry Trends  DSP Architecture  DSP Micro-Architecture  DSP Systems

4 Moore’s Law Drives Processor Development ™ 486™ Pentium ® Pentium ® II Pentium ® III Pentium ® 4 Itanium ® Transistors per Die Itanium ® Data (Moore) Microprocessor ‘60‘65‘70‘75‘80‘85‘90‘95‘00‘05‘10 Source: Intel internal Doubling the number of transistors every at same price point drives significant product opportunities …especially if you have little regard for power But what if energy-delay had to be reduced every generation by an order of magnitude?

5 Gene’s Law Drives DSP Development Gene’s Law DSP Power 1, mW/MIPS Year Gene’s Law will have it’s challenges to hold the line!

6 Digital Audio u MP3 u Real Audio Streaming Video u MPEG 4 u H.263 Connectivity u Internet u Bluetooth Modem Standards TXN UPX 12 3/4 u UMTS u GMS BuyNow? Yes No What’s Driving Gene’s Law?

7 DSP Design Constraints Technology (uM) Transistors MIPS RAM (bytes) Power (mW/MIPS) Price/MIPS 3 50K $ K 40 2K 12.5 $ M 5,000 3M 0.1 $ DEVICE CAPABILITIES

8 Agenda  Industry Trends  DSP Architecture  DSP Micro-Architecture  DSP Systems

9 What Makes a DSP a DSP?  Single-Cycle MAC  Multiple Execution Units  High Bandwidth (Flat) Memory Sub-Systems  Efficient Zero-Overhead Looping  Short Pipeline  High Bandwidth I/O  Specialized Instruction Sets  Sophisticated DMA  Little to No Speculation

10 Single Cycle MAC  MAC’s Typically Determine DSP Performance and Pipeline Length (EX)  Most DSP’s Have 2-8 MAC Units  MAC’s Typically Operate in Both a Scalar and Vector Mode

11 Multiple Instruction Units  VLIW Architectures Driving ILP  Typically Instruction Units  M-Unit - MAC  S-Unit - Shift  L-Unit - ALU  D-Unit – Load/Store  Industry Has Converged on a ILP of ~8 DDATA_I2 (load data) D2 DS1S2 M1 DS1S2 D1 DS1S2 DDATA_I1 (load data) 2X1X L 1L 1S1 S2 DL SL DDL S2S1 D M2L2S2 D DL SL DDL S2S1 S2 D S1 Registers B0 - B15Registers A0 - A15

12 High Bandwidth Memory Sub-Systems  Multiple Load-Store Units Required to Feed Data Path  Tightly Coupled Memory is Typically Dual Ported  Harvard Architecture is Heavily Banked Central Arithmetic Logic Unit EXTERNAL MEMORY MUXMUXMUXMUX INTERNAL MEMORY MUXESMUXESMUXESMUXES P ALU SHIFTER B MAC A PCCNTL E C D ARs

13 Specialized Instruction Sets  Base RISC ISA Plus CISC ISA Driven by End Application  MAC  SAD  LMS  FIRS  Viterbi  Support For Both Scalar and Vector Instructions  Support For 8, 16 and 32-Bit Instructions  Instructions are Highly Orthogonal

14 Scalar (55x) vs VLIW (64x)  Scalar DSP’s Tend to be More CISC Like  Hurts Compiler Performance  Improves Energy-Delay  Improves Code Density  Limits Top End Performance  VLIW DSP’s Tend to be More RISC Like  RISC + GP Regs + Orthogonality Makes For a Good C Compiler  Assembler Code Is Challenging  RISC ISA Allows for Higher Frequencies  Load-Store Hurts Energy-Delay

15 TMS320C54x

16 TMS320C54x Protected Pipeline CYCLES P 1 D1D1 F2F2 P3P3 A1A1 D2D2 F3F3 P4P4 R1R1 A2A2 D3D3 F4F4 P5P5 X1X1 P6P6 R2R2 A3A3 D4D4 F5F5 F6F6 X2X2 R3R3 A4A4 D5D5 F1F1 P2P2 D6D6 X3X3 R4R4 A5A5 A6A6 X4X4 R5R5 R6R6 X5X5 X6X6 Fully loaded pipeline Note: Protected Pipeline Limits Micro-Architectural Flexibility and Performance Prefetch: Calculate address of instruction Fetch: Collect instruction Decode: Interpret instruction Access: Collect address of operand Read: Collect operand Execute: Perform operation

17 TMS320C6xx Arithmetic Logic Unit Auxiliary Logic Unit Multiplier Unit ’C6xx CPU Core Data Path 1 D1M1S1L1 A Register File Data Path 2 L2S2M2D2 B Register File Instruction Decode Instruction Dispatch Program Fetch Interrupts Control Registers Control Logic Emulation Test

18 TMS320C6xx Exposed Pipeline Fetch PGPSPWPRDPDCE1E2E3E4E5 DecodeExecute Execute Packet 1  Fetch  PGProgram Address Generate  PS Program Address Send  PWProgram Access Ready Wait  PRProgram Fetch Packet Receive  Decode  DPInstruction Dispatch  DCInstruction Decode  Execute  E1 - E5 Execute 1 through Execute 5 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 2 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 3 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 4 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 5 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 6 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 7 PGPSPWPRDPDCE1E2E3E4E5 Note: Exposed Pipeline Adds Risk to Programming Model

19 Agenda  Industry Trends  DSP Architecture  DSP Micro-Architecture  DSP Systems

20 Micro-Architectural Challenges  Accessing (Flat) On Chip Memory At Speed Within 2-3 cycles  Feeding Multiple Functional Units From a Single Register File  Running 600Mhz+ with a 7-9 Stage Pipeline  Linking Multiple Functional Units with Result Forwarding  Implementing CISC Data-path to Meet Area and Performance Goals  Achieving ARM Like Code Density

21 What Does and Doesn’t Work?  Do  Banked Memory  Dual Access Memory  Full Custom Register Files  Split/Multiple Register Files  Custom/Semi-Custom Data-paths  Variable Length Instructions  CISC ISA  Co-Processors  Multi-Core  Don’t  Multi-Level Caches  Super-Scalar  VLIW Packet Descriptors  Speculative Branching  Full Synthesis  Dynamic Logic  Consider  Multi-Threading  uP with Co-Processors

22 Agenda  Industry Trends  DSP Architecture  DSP Micro-Architecture  DSP Systems

23 DSP Systems TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client Digital Still Camera TMS320DM310 DSP+GPP Imaging accelerators TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 3MB integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client TMS320DM310 DSP+GPP Imaging accelerators Digital Still Camera 225 MHz Floating point TMS320DA610 Performance Audio TMS320C MHz Viterbi and Turbo hardware accelerators Wireless Infrastructure 6 DSP 300MHz 24Mb integrated memory 180M transistors TMS320C5561 Wired Infrastructure OMAP5910 DSP+GPP Low power consumption Voice, data, video Wireless Client TMS320DM310 DSP+GPP Imaging accelerators Digital Still Camera 225 MHz Floating point TMS320DA610 Performance Audio

24 VIOP Platform  TNETV3010 Features  6 C55x 300 MHz  Shared Instruction Memory  Broadcast DMA  24M Bits of On Chip SRAM

25 DaVinci Platform

26 OMAP Platform  OMAP2420 Features  ARM 330 MHz, VFP (Vector Floating Point), 32K/32K I/Dcache  220 MHz  2D/3D graphics accelerator  IVA supports still images to >4 Mpixels, 30 fps VGA video decode  Output to TV for gaming and video playback  Encryption hardware for DRM and security ARM11 + VFP 2D/3D Graphics Accelerator Camera I/F Memory Controller Peripherals L4 Interconnect Imaging & Video Accelerator (IVA) Internal SRAM OMAP2420 LCD I/F Video Out L3 Interconnect TMS320C55x DSP Security


Download ppt "Survey of Digital Signal Processors Michael Warner ECD: VLSI Communication Systems."

Similar presentations


Ads by Google