Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMPE750 - Shaaban #1 lec # 8 Spring 2015 4-2-2015 Computing Engine Choices General Purpose Processors (GPPs): Intended for general purpose computing (desktops,

Similar presentations


Presentation on theme: "CMPE750 - Shaaban #1 lec # 8 Spring 2015 4-2-2015 Computing Engine Choices General Purpose Processors (GPPs): Intended for general purpose computing (desktops,"— Presentation transcript:

1 CMPE750 - Shaaban #1 lec # 8 Spring 2015 4-2-2015 Computing Engine Choices General Purpose Processors (GPPs): Intended for general purpose computing (desktops, servers, clusters..) Application-Specific Processors (ASPs): Processors with ISAs and architectural features tailored towards specific application domains –E.g Digital Signal Processors (DSPs), Network Processors (NPs), Media Processors, Graphics Processing Units (GPUs), Vector Processors???... Co-Processors: A hardware (hardwired) implementation of specific algorithms with limited programming interface (augment GPPs or ASPs) Configurable Hardware: –Field Programmable Gate Arrays (FPGAs) –Configurable array of simple processing elements Application Specific Integrated Circuits (ASICs): A custom VLSI hardware solution for a specific computational task The choice of one or more depends on a number of factors including: - Type and complexity of computational algorithm (general purpose vs. Specialized) - Desired level of flexibility and programmability - Performance requirements - Desired level of computational efficiency (e.g Computations per watt or computations per chip area) - Power requirements - Real-time constraints - Development time and cost - System cost General Purpose ISAs (RISC or CISC) Special Purpose ISAs Repeated here from lecture 1 The ISA forms an abstraction layer that sets the requirements for both complier and CPU designers - Expected useful lifecycle of computing element or system Processors

2 CMPE750 - Shaaban #2 lec # 8 Spring 2015 4-2-2015 Computing Engine Choices Flexibility General Purpose Processors (GPPs): Application-Specific Processors (ASPs) Co-Processors Application Specific Integrated Circuits (ASICs) Configurable Hardware e.g Digital Signal Processors (DSPs), Network Processors (NPs), Media Processors, Graphics Processing Units (GPUs) Physics Processor …. -Type and complexity of computational algorithm (general purpose vs. Specialized) - Desired level of flexibility and programmability - Performance requirements - Desired level of computational efficiency - Power requirements - Real-time constraints - Development time and cost - System cost Selection Factors: Programmability / Processor = Programmable computing element that runs programs written using a pre-defined set of instructions Specialization, Development cost/time Performance/Chip Area/Watt (Computational Efficiency) Repeated here from lecture 1 Application Domain Requirements ASP ISA ASP Architecture For Application-Specific Processors (ASPs): SoftwareHardware (Processors) ISA ASPs

3 CMPE750 - Shaaban #3 lec # 8 Spring 2015 4-2-2015 Computing Element Choices Observation Generality and efficiency are in some sense inversely related to one another: –The more general-purpose a computing element is and thus the greater the number of tasks it can perform, the less efficient (e.g. Computations per chip area /watt) it will be in performing any of those specific tasks. –Design decisions are therefore almost always compromises; designers identify key features or requirements of applications that must be met and and make compromises on other less important features. To counter the problem of computationally intense and specialized problems for which general purpose processors/machines cannot achieve the necessary performance/other requirements: –Special-purpose processors (or Application-Specific Processors, ASPs), attached processors, and coprocessors have been designed/built for many years, for specific application domains, such as image or digital signal processing (for which many of the computational tasks are specialized and can be very well defined). Generality = Flexibility = Programmability ? Efficiency = Computational Efficiency (Computations per watt or chip area) Why Application-Specific Processors (ASPs)? i.e computational efficiency ASPs

4 CMPE750 - Shaaban #4 lec # 8 Spring 2015 4-2-2015 Digital Signal Processor (DSP) Architecture Classification of Main Processor Types/ApplicationsClassification of Main Processor Types/Applications Requirements of Embedded ProcessorsRequirements of Embedded Processors DSP vs. General Purpose CPUsDSP vs. General Purpose CPUs DSP Cores vs. ChipsDSP Cores vs. Chips Classification of DSP ApplicationsClassification of DSP Applications DSP Algorithm FormatDSP Algorithm Format DSP BenchmarksDSP Benchmarks Basic Architectural Features of DSPsBasic Architectural Features of DSPs DSP Software Development Considerations Classification of Current DSP Architectures and example DSPs:Classification of Current DSP Architectures and example DSPs: –Conventional DSPs: TI TMSC54xx Conventional DSPs: TI TMSC55xx –Enhanced Conventional DSPs: TI TMSC55xx –Multiple-Issue DSPs: VLIW DSPs: TI TMS320C62xx, TMS320C64xxVLIW DSPs: TI TMS320C62xx, TMS320C64xx Superscalar DSPs: LSI Logic ZSP400/500 DSP coreSuperscalar DSPs: LSI Logic ZSP400/500 DSP core DSPs are often embedded DSP Generations 1-2 3 4

5 CMPE750 - Shaaban #5 lec # 8 Spring 2015 4-2-2015 Main Processor Types/Applications General Purpose Computing & General Purpose Processors (GPPs) – –High performance: In general, faster is always better. –RISC or CISC: Intel P4, IBM Power4, SPARC, PowerPC, MIPS... –Used for general purpose software –End-user programmable –Real-time performance may not be fully predictable (due to dynamic arch. features) –Heavy weight, multi-tasking OS - Windows, UNIX –Normally, low cost and power not a requirement (changing) –Servers, Workstations, Desktops (PC’s), Notebooks, Clusters … Embedded Processing: Embedded processors and processor cores –Cost, power code-size and real-time requirements and constraints –Once real-time constraints are met, a faster processor may not be better –e.g: Intel XScale, ARM, 486SX, Hitachi SH7000, NEC V800... –Often require Digital signal processing (DSP) support or other application-specific support (e.g network, media processing) –Single or few specialized programs – known at system design time –Not end-user programmable –Real-time performance must be fully predictable (avoid dynamic arch. features) –Lightweight, often realtime OS or no OS –Examples: Cellular phones, consumer electronics.. … Microcontrollers –Extremely code size/cost/power sensitive –Single program –Small word size - 8 bit common –Usually no OS –Highest volume processors by far –Examples: Control systems, Automobiles, industrial control, thermostats,... Increasing Cost/Complexity Increasing volume Examples of Application-Specific Processors (ASPs) 64 bit 16-32 bit 8-16 bit

6 CMPE750 - Shaaban #6 lec # 8 Spring 2015 4-2-2015 The Processor Design Space Processor Cost Performance Microprocessors Performance is everything & Software rules Embedded processors Microcontrollers Cost is everything Application specific architectures for performance GPPs Real-time constraints Specialized applications Low power/cost constraints Chip Area, Power complexity (Main Types) Examples of ASPs

7 CMPE750 - Shaaban #7 lec # 8 Spring 2015 4-2-2015 Requirements of Embedded Processors Usually must meet strict real-time constraints: –Real-time performance must be fully predictable: Avoid dynamic processor architectural features that make real-time performance harder to predict ( e.g cache, dynamic scheduling, hardware speculation …) –Once real-time constraints are met, a faster processor is not desirable (overkill) due to increased cost/power requirements. Optimized for a single (or few) program (s) - code often in on-chip ROM or on/off chip EPROM/flash memory. Minimum code size (one of the motivations initially for Java) Performance obtained by optimizing datapath Low cost –Lowest possible area Computation per unit areaHigh computational efficiency: Computation per unit area –VLSI implementation technology usually behind the leading edge –High level of integration of peripherals (System-on-Chip -SoC- approach reduces system cost/power) Fast time to market –Compatible architectures (e.g. ARM family) allows reusable code –Customizable cores (System-on-Chip, SoC). Low power if application requires portability Embedded Processors: How Fast? Good or bad?

8 CMPE750 - Shaaban #8 lec # 8 Spring 2015 4-2-2015 Area of processor cores = Cost Nintendo processor Cellular phones Embedded Processors (and Power requirements) Thus need to minimize chip area Embedded version of a GPP

9 CMPE750 - Shaaban #9 lec # 8 Spring 2015 4-2-2015 Another figure of merit: Computation per unit chip area Nintendo processor Cellular phones Embedded Processors (Computational Efficiency) Embedded version of a GPP

10 CMPE750 - Shaaban #10 lec # 8 Spring 2015 4-2-2015 Code size If a majority of the chip is the program stored in ROM, then minimizing code size is a critical issue Common embedded processor ISA features to minimize code size: –Variable length instruction encoding common: e.g. the Piranha has 3 sized instructions - basic 2 byte, and 2 byte plus 16 or 32 bit immediate –Complex/specialized instructions –Complex addressing modes Embedded Processors Smaller is better 1 2 3 How? CISC-Like ?

11 CMPE750 - Shaaban #11 lec # 8 Spring 2015 4-2-2015 Embedded Systems vs. General Purpose Computing (and processors GPPs) General Purpose Computing Systems Embedded Systems (and embedded processors) Real-time performance may not be fully predictable (due to dynamic processor architectural features): Superscalar: dynamic scheduling, hardware speculation, branch prediction, cache. Used for general purpose software : Intended to run a fully general set of applications that may not be known at design time Run a single or few specialized applications often known at system design time End-user programmableNot end-user programmable In general, no real-time constraints Heavy weight, multi-tasking OS - Windows, UNIX Lightweight, often real-time OS or no OS Faster (higher-performance) is always better Usually must meet strict real-time constraints –(e.g. real-time sampling rate) Once real-time constraints are met, a faster processor is not desirable (overkill) due to increased cost/power requirements. Minimum code size is highly desirableMinimizing code size is not an issue Low power and cost constraints/requirements Higher power and cost constraints/requirements May require application-specific capability (e.g DSP) No application-specific capability required Real-time performance must be fully predictable: Avoid dynamic processor architectural features that make real-time performance harder to predict Thus usually

12 CMPE750 - Shaaban #12 lec # 8 Spring 2015 4-2-2015 Evolution of GPPs and DSPs General Purpose Processors (GPPs) trace roots back to Eckert, Mauchly, Von Neumann (ENIAC) Digital Signal Processors (DSPs) are microprocessors designed for efficient mathematical manipulation of digital signals utilizing digital signal processing algorithms. –DSPs usually process infinite continuous sampled (digitized) data streams (physical signals) while meeting real-time and power constraints. –DSPs evolved from Analog Signal Processors (ASPs) that utilize analog hardware to transform physical signals (classical electrical engineering) –ASP to DSP because: DSP insensitive to environment (e.g., same response in snow or desert if it works at all) DSP performance identical even with variations in components; 2 analog systems behavior varies even if built with same components with 1% variation Different history and different applications requirements led to different ISA design considerations, terms, different metrics, architectures, some new inventions. + EDSAC First generation processors Application Domain Requirements ASP ISA ASP Architecture For Application-Specific Processors (ASPs): i.e.

13 CMPE750 - Shaaban #13 lec # 8 Spring 2015 4-2-2015 DSP vs. General Purpose CPUs DSPs tend to run one (or few) program(s), not many programs. –Hence OSes (if any) are much simpler, there is no virtual memory or protection,... DSPs usually run applications with hard real-time constraints: –DSP must meet application signal sampling rate computational requirements: Once above real-time constraints are met, a faster DSP is overkill (higher DSP cost, power..) without additional benefit. –You must account for anything that could happen in a time slot (DSP algorithm inner-loop, data sampling rate) –All possible interrupts or exceptions must be accounted for and their collective time be subtracted from the time interval. Therefore, exceptions are BAD. DSPs usually process infinite continuous data streams: –Requires high memory bandwidth (with predictable latency, e.g no data cache) for streaming real-time data samples and predictable processing time on the data samples The design of DSP ISAs and processor architectures is driven by the requirements of DSP algorithms. –Thus DSPs are application-specific processors DSP Algorithms DSP ISAs DSP Architectures DSP Performance Requirements Similar to other embedded processors

14 CMPE750 - Shaaban #14 lec # 8 Spring 2015 4-2-2015 DSP vs. GPP The “MIPS/MFLOPS” of DSPs is speed of Multiply-Accumulate (MAC). –MAC is common in DSP algorithms that involve computing a vector dot product, such as digital filters, correlation, and Fourier transforms. –DSP are judged by whether they can keep the multipliers busy 100% of the time and by how many MACs are performed in each cycle. The "SPEC" of DSPs is 4 algorithms: –Inifinite Impule Response (IIR) filters –Finite Impule Response (FIR) filters –FFT, and –convolvers In DSPs, target algorithms are important: –Binary compatibility not a major issue High-level Software is not as important in DSPs as in GPPs. –People still write in assembly language for a product to minimize the die area for ROM in the DSP chip and improve performance. i.e Main performance measure of DSPs is MAC speed Why? Note: While this is still mostly true, however, programming for DSPs in high level languages (HLLs) has been gaining more acceptance due to the development of more efficient HLL DSP compilers in recent years. unlike general purpose Since DSPS are application domain specific processors Code size

15 CMPE750 - Shaaban #15 lec # 8 Spring 2015 4-2-2015 Types of DSP Processors 32-BIT FLOATING POINT (5% of DSP market): –TI TMS320C3X, TMS320C67xx (VLIW) –AT&T DSP32C –ANALOG DEVICES ADSP21xxx –Hitachi SH-4 16-BIT FIXED POINT (95% of DSP market): –TI TMS320C2X, TMS320C62xx (VLIW) –Infineon TC1xxx (TriCore1) (VLIW) –MOTOROLA DSP568xx, MSC810x (VLIW) –ANALOG DEVICES ADSP21xx –Agere Systems DSP16xxx, Starpro2000 –LSI Logic LSI140x (ZPS400) superscalar –Hitachi SH3-DSP –StarCore SC110, SC140 (VLIW) According to type of Arithmetic/operand Size Supported Examples Or 24 bit

16 CMPE750 - Shaaban #16 lec # 8 Spring 2015 4-2-2015 DSP Cores vs. Chips DSP are usually available as synthesizable cores or off-the- shelf packaged chips Synthesizable Cores: –Map into chosen fabrication process Speed, power, and size vary –Choice of peripherals, etc. (SoC) –Requires extensive hardware development effort. Off-the-shelf packaged chips: –Highly optimized for speed, energy efficiency, and/or cost. –Lower development time/cost/effort. –Tools, 3rd-party support often more mature. –Faster time to market. –Limited performance, integration options. SOC = System On Chip Resulting in more development time and cost (very high volume needed to justify development cost IP Instruction Memory Data Memory A/D Converter D/A Converter Serial Ports DSP Core

17 CMPE750 - Shaaban #17 lec # 8 Spring 2015 4-2-2015 DSP ARCHITECTURE Enabling Technologies First microprocessor DSP TI TMS 32010 1 2 3 4 Generations of single-chip (microprocessor) DSPs

18 CMPE750 - Shaaban #18 lec # 8 Spring 2015 4-2-2015 Texas Instruments TMS320 Family Multiple DSP  P Generations 12341234 (VLIW) Generations of single-chip (microprocessor) DSPs

19 CMPE750 - Shaaban #19 lec # 8 Spring 2015 4-2-2015 DSP Applications Digital audio applications –MPEG Audio –Portable audio Digital cameras Cellular telephones Wearable medical appliances Storage products: –disk drive servo control Military applications: –radar –sonar Industrial control Seismic exploration Networking: (Telecom infrastructure) –Wireless –Base station – Cable modems – ADSL – VDSL –…... Current DSP Killer Applications: Cell phones and telecom infrastructure HDTV? ….. Other?

20 CMPE750 - Shaaban #20 lec # 8 Spring 2015 4-2-2015 DSP Algorithms & Applications

21 CMPE750 - Shaaban #21 lec # 8 Spring 2015 4-2-2015 Another Look at DSP Applications High-end: –Military applications (e.g. radar/sonar) –Wireless Base Station - TMS320C6000 –Cable modem –Gateways - HDTV … Mid-range: –Industrial control –Cellular phone - TMS320C540 –Fax/ voice server … Low end: –Storage products - TMS320C27 (hard drive controllers) –Digital camera - TMS320C5000 –Portable phones –Wireless headsets –Consumer audio –Automobiles, thermostats,... Increasing Cost Increasing volume

22 CMPE750 - Shaaban #22 lec # 8 Spring 2015 4-2-2015 DSP range of applications & Possible Target DSPs

23 CMPE750 - Shaaban #23 lec # 8 Spring 2015 4-2-2015 Cellular Phone System PHYSICAL LAYER PROCESSING RF MODEM CONTROLLER 1 2 3 4 5 6 7 8 9 0 415-555-1212 SPEECH DECODE SPEECH ENCODE A/D BASEBAND CONVERTER DAC Example DSP Application

24 CMPE750 - Shaaban #24 lec # 8 Spring 2015 4-2-2015 Cellular Phone: HW/SW/IC Partitioning PHYSICAL LAYER PROCESSING RF MODEM CONTROLLER 1 2 3 4 5 6 7 8 9 0 415-555-1212 SPEECH DECODE SPEECH ENCODE A/D BASEBAND CONVERTER DAC ANALOG IC DSP ASIC MICROCONTROLLER Example DSP Application

25 CMPE750 - Shaaban #25 lec # 8 Spring 2015 4-2-2015 Mapping Onto System-on-Chip (SoC) RAM µC RAM DSP CORE ASIC LOGIC S/P DMA phone book protocol keypad intfc control S/P DMA speech quality enhancment de-intl & decoder voice recognition RPE-LTP speech decoder demodulator and synchronizer Viterbi equalizer (Cellular Phone) Example DSP Application DSP Core Micro-controller or embedded processor

26 CMPE750 - Shaaban #26 lec # 8 Spring 2015 4-2-2015 Example Cellular Phone Organization C540 ARM7 (DSP) (µC) Example DSP Application

27 CMPE750 - Shaaban #27 lec # 8 Spring 2015 4-2-2015 Multimedia System-on-Chip (SoC) Future chips will be a mix of processors, memory and dedicated hardware for specific algorithms and I/O µP DSP Coms Video Unit custom Memory Uplink Radio Downlink Radio Graphics Out Video I/O Voice I/O Pen In e.g. Multimedia terminal electronics (ASIC) ASIC Co-processor Or ASP Example DSP Application

28 CMPE750 - Shaaban #28 lec # 8 Spring 2015 4-2-2015 DSP Algorithm Format DSP culture has a graphical format to represent formulas. Like a flowchart for formulas, inner loops, not programs. Some seem natural:  is add, X is multiply Others are obtuse: z –1 means take variable from earlier iteration (delay). These graphs are trivial to decode i.e. DSP algorithms

29 CMPE750 - Shaaban #29 lec # 8 Spring 2015 4-2-2015 DSP Algorithm Notation Uses “flowchart” notation instead of equations Multiply is or X Add is or +  Delay/Storage is oror Delayz –1 D

30 CMPE750 - Shaaban #30 lec # 8 Spring 2015 4-2-2015 Typical DSP Algorithm: Finite-Impulse Response (FIR) Filter Filters reduce signal noise and enhance image or signal quality by removing unwanted frequencies. Finite Impulse Response (FIR) filters compute: where –x is the input sequence –y is the output sequence –h is the impulse response (filter coefficients) –N is the number of taps (coefficients) in the filter Output sequence depends only on input sequence and impulse response. Vector Dot Product: Multiply Accumulate (MAC) Operations i.e filter coefficients Signal samples Filter coefficients N Taps

31 CMPE750 - Shaaban #31 lec # 8 Spring 2015 4-2-2015 Typical DSP Algorithms: Finite-impulse Response (FIR) Filter N most recent samples in the delay line (Xi) New sample moves data down delay line Filter “Tap” is a multiply-add Each tap (N taps total) nominally requires: –Two data fetches –Multiply –Accumulate –Memory write-back to update delay line Special addressing modes (e.g modulo) Performance Goal: At least 1 FIR Tap / DSP instruction cycle Requires real-time data sample streaming Predictable data bandwidth/latency Special addressing modes Separate memory banks/busses? Repetitive computations, multiply and accumulate (MAC) Requires efficient MAC support (Multiply And Accumulate, MAC) MAC

32 CMPE750 - Shaaban #32 lec # 8 Spring 2015 4-2-2015 FINITE-IMPULSE RESPONSE (FIR) FILTER.. X Y h0h0 h1h1 h N-2 h N-1 A Filter Tap Performance Goal: at least 1 FIR Tap / DSP instruction cycle DSP must meet application signal sampling rate computational requirements: A faster DSP is overkill (more cost/power than really needed) i.e. Vector dot product One FIR Filter Tap Delay (accumulator register) Filter coefficients Delayed samples MAC Signal Samples Filter Coefficients From A/D To D/A

33 CMPE750 - Shaaban #33 lec # 8 Spring 2015 4-2-2015 Sample Computational Rates for FIR Filtering 1-D FIR has n op = 2N and a 2-D FIR has n op = 2N 2. OPs = Operation Per Second (4.37 GOPs) (23.3 GOPs) DSP must meet application signal sampling rate computational requirements: A faster DSP is overkill (higher DSP cost, power..) FIR Type 1-D 2-D DSP Performance Requirements

34 CMPE750 - Shaaban #34 lec # 8 Spring 2015 4-2-2015 FIR Filter on (Simple) General Purpose Processor (GPP) loop: lw x0, 0(r0) lw y0, 0(r1) mula, x0,y0 add y0,a,b swy0,(r2) inc r0 inc r1 inc r2 dec ctr tst ctr jnz loop Problems: Bus / memory bandwidth bottleneck, control/loop code overhead No suitable addressing modes, instructions - –e.g. multiply and accumulate (MAC) instruction + GPP Real-time performance may (to meet signal sampling rate) not be fully predictable (due to dynamic processor architectural features): Superscalar: dynamic scheduling, hardware speculation, branch prediction, cache. +

35 CMPE750 - Shaaban #35 lec # 8 Spring 2015 4-2-2015 Infinite Impulse Response (IIR) filters compute: Output sequence depends on input sequence, previous outputs, and impulse response. Both FIR and IIR filters –Require vector dot product (multiply-accumulate) operations –Use fixed coefficients Adaptive filters update their coefficients to minimize the distance between the filter output and the desired signal. Typical DSP Algorithms: Infinite-Impulse Response (IIR) Filter i.e Filter coefficients: a(k), b(k) MAC normally

36 CMPE750 - Shaaban #36 lec # 8 Spring 2015 4-2-2015 The Discrete Fourier Transform (DFT) allows for spectral analysis in the frequency domain. It is computed as for k = 0, 1, …, N-1, where –x is the input sequence in the time domain –y is an output sequence in the frequency domain The Inverse Discrete Fourier Transform is computed as The Fast Fourier Transform (FFT) provides an efficient method for computing the DFT. Typical DSP Algorithms: Typical DSP Algorithms: Discrete Fourier Transform (DFT) MAC Time Domain Frequency Domain

37 CMPE750 - Shaaban #37 lec # 8 Spring 2015 4-2-2015 The Discrete Cosine Transform (DCT) is frequently used in image & video compression (e.g. JPEG, MPEG-2). The DCT and Inverse DCT (IDCT) are computed as: where e(k) = 1/sqrt(2) if k = 0; otherwise e(k) = 1. A N-Point, 1D-DCT requires N 2 MAC operations. Typical DSP Algorithms: (DCT) Typical DSP Algorithms: Discrete Cosine Transform (DCT) MAC

38 CMPE750 - Shaaban #38 lec # 8 Spring 2015 4-2-2015 DSP BENCHMARKS DSPstone:DSPstone: University of Aachen, application benchmarks –ADPCM TRANSCODER - CCITT G.721, REAL_UPDATE, COMPLEX_UPDATES –DOT_PRODUCT, MATRIX_1X3, CONVOLUTION –FIR, FIR2DIM, HR_ONE_BIQUAD –LMS, FFT_INPUT_SCALED BDTImark2000: Berkeley Design Technology Inc –12 DSP kernels in hand-optimized assembly language: FIR, IIR, Vector dot product, Vector add, Vector maximum, FFT …. –Returns single number (higher means faster) per processor –Use only on-chip memory (memory bandwidth is the major bottleneck in performance of embedded applications). EEMBC (pronounced “embassy”): EDN Embedded Microprocessor Benchmark Consortium –30 companies formed by Electronic Data News (EDN) –Benchmark evaluates compiled C code on a variety of embedded processors (microcontrollers, DSPs, etc.) –Application domains: automotive-industrial, consumer, office automation, networking and telecommunications BDTI

39 CMPE750 - Shaaban #39 lec # 8 Spring 2015 4-2-2015 1st Generation 2nd Generation 3rd Generation 4th Generation > 800x Faster than first generation DSPs from generations 2, 3 and 4 are in use today. Why?

40 CMPE750 - Shaaban #40 lec # 8 Spring 2015 4-2-2015 Basic DSP ISA/Architectural Features Data path configured for DSP algorithms –Fixed-point arithmetic (most DSPs) Modulo arithmetic (saturation to handle overflow) –MAC- Multiply-accumulate unit(s) –Hardware rounding support Multiple memory banks and buses - –Harvard Architecture –Multiple data memories/buses Specialized addressing modes – Bit-reversed addressing – Circular buffers Specialized instruction set and execution control –Zero-overhead loops –Support for fast MAC –Fast Interrupt Handling Specialized peripherals for DSP - (System on Chip - SoC style) Usually with no data cache for predictable fast data sample streaming To meet real-time signal sampling/processing constraints Dedicated address generation units are usually used DSP ISA Feature DSP Architectural Feature DSP ISA Feature DSP Architectural Features DSP Architectural Feature Specialized DSP Algorithms/Application Requirements DSP ISAs DSP Architectures

41 CMPE750 - Shaaban #41 lec # 8 Spring 2015 4-2-2015 DSP Data Path: Arithmetic DSP Data Path: Arithmetic DSPs dealing with numbers representing real world signals => Want “reals”/ fractions DSPs dealing with numbers for addresses => Want integers DSP ISA (and DSP) must Support “fixed point” as well as integers S. radix point -1 Š x < 1 S. radix point –2 N–1 Š x < 2 N–1 Usually 16-bit fixed-point In DSP ISAs: Fixed-point arithmetic must be supported, floating point support is optional and is much less common DSP ISA Feature DSP ISA Features Most Common: Fixed Point (16-bit or 24-bit) + Integer Arithmetic Much Less Common: Single Precision Floating-point Support Fixed-point Thus

42 CMPE750 - Shaaban #42 lec # 8 Spring 2015 4-2-2015 DSP Data Path: Precision Word size affects precision of fixed point numbers DSPs have 16-bit, 20-bit, or 24-bit data words Floating Point DSPs cost 2X - 4X vs. fixed point, slower than fixed point DSP programmers will scale values inside code –SW Libraries –Separate explicit exponent “Blocked Floating Point” single exponent for a group of fractions Floating point support simplify development for high-end DSP applications. 16-bit most common In DSP ISAs: Fixed-point arithmetic must be supported, floating point (single precision) support is optional and is much less common DSP ISA Features 16-bit Fixed-Point Most Common Single Precision

43 CMPE750 - Shaaban #43 lec # 8 Spring 2015 4-2-2015 DSP Data Path: Overflow Handling DSP are descended from analog signal processors: –Modulo Arithmetic. Set to most positive (2 N–1 –1) or most negative value(–2 N–1 ) : “saturation” Many DSP algorithms were developed in this model. Due to physical nature of signals –2 N–1 2 N–1 –1 Saturation Why Support? DSP ISA Feature Saturation

44 CMPE750 - Shaaban #44 lec # 8 Spring 2015 4-2-2015 DSP Data Path: Specialized Hardware Fast specialized hardware functional units performs all key arithmetic operations in 1 cycle, including: – Shifters –Saturation –Guard bits –Rounding modes –Multiplication/addition (MAC) 50% of instructions can involve multiplier => single cycle latency multiplier Need to perform multiply-accumulate (MAC) fast n-bit multiplier => 2n-bit product DSP Architectural Features To help meet real-time constraints for commonly needed operations i.e. must optimize common operations

45 CMPE750 - Shaaban #45 lec # 8 Spring 2015 4-2-2015 DSP Data Path: Multiply Accumulate (MAC) Unit Don’t want overflow or have to scale accumulator Option 1: accumalator wider than product: “guard bits” –Motorola DSP: 24b x 24b => 48b product, 56b Accumulator Option 2: shift right and round product before adder Accumulator ALU Multiplier Accumulator ALU Multiplier Shift G } MAC Unit One or more MAC units add

46 CMPE750 - Shaaban #46 lec # 8 Spring 2015 4-2-2015 DSP Data Path: Rounding Modes Even with guard bits, will need to round when storing accumulator into memory 3 DSP standard options (supported in hardware) Truncation: chop results => biases results up Round to nearest: smaller bias Convergent: 1/2 round up (more positive), = 1/2 round to make lsb a zero (+1 if 1, +0 if 0) => no bias IEEE 754 calls this round to nearest even Not in software as in GPPs 1 2 3

47 CMPE750 - Shaaban #47 lec # 8 Spring 2015 4-2-2015 Data Path Comparison DSP Processor Specialized hardware performs all key arithmetic operations in 1 cycle. –e.g MAC Hardware support for managing numeric fidelity: –Shifters –Guard bits –Saturation –Rounding modes General-Purpose Processor Multiplies often take>1 cycle Shifts often take >1 cycle Other operations (e.g., saturation, rounding) typically take multiple cycles.

48 CMPE750 - Shaaban #48 lec # 8 Spring 2015 4-2-2015 TI 320C54x DSP (1995) Functional Block Diagram Hardware support for rounding/saturation MAC Unit Multiple memory banks and buses

49 CMPE750 - Shaaban #49 lec # 8 Spring 2015 4-2-2015 First Commercial DSP (1982): Texas Instruments TMS32010 16-bit fixed-point arithmetic Introduced at 5Mhz (200ns) instruction cycle. “Harvard architecture” –separate instruction, data memories Accumulator Specialized instruction set –Load and Accumulate Two-cycle (400 ns) Multiply- Accumulate (MAC) time. Processor Instruction Memory Data Memory T-Register Accumulator ALU Multiplier Datapath: P-Register Mem i.e MAC Unit i.e. Single-Chip DSP (Microprocessor)

50 CMPE750 - Shaaban #50 lec # 8 Spring 2015 4-2-2015 First Generation DSP  P Texas Instruments TMS32010 - 1982 Features 200 ns instruction cycle (5 MIPS) 144 words (16 bit) on-chip data RAM 1.5K words (16 bit) on-chip program ROM - TMS32010 External program memory expansion to a total of 4K words at full speed 16-bit instruction/data word single cycle 32-bit ALU/accumulator Single cycle 16 x 16-bit multiply in 200 ns Two cycle MAC (5 MOPS) Zero to 15-bit barrel shifter Eight input and eight output channels

51 CMPE750 - Shaaban #51 lec # 8 Spring 2015 4-2-2015 First Generation DSP  P TI TMS32010 Block Diagram MAC Unit Program Memory (ROM/EPROM) Data/Samples Memory Barrel Shifter (1 cycle)

52 CMPE750 - Shaaban #52 lec # 8 Spring 2015 4-2-2015 TMS32010 FIR Filter Code Here X4, H4,... are direct (absolute) memory addresses: LT X4 ; Load T with x(n-4) MPY H4 ; P = H4*X4 LTD X3 ; Load T with x(n-3); x(n-4) = x(n-3); ; Acc = Acc + P MPY H3 ; P = H3*X3 LTD X2 MPY H2... Two instructions per tap, but requires unrolling Load and Accumulate Physical

53 CMPE750 - Shaaban #53 lec # 8 Spring 2015 4-2-2015 DSP Memory FIR Tap implies multiple memory accesses DSPs require multiple data ports Some DSPs have ad hoc techniques to reduce memory bandwdith demand: –Instruction repeat buffer: do 1 instruction 256 times –Often disables interrupts, thereby increasing interrupt response time Some recent DSPs have instruction caches –Even then may allow programmer to “lock in” instructions into cache –Option to turn cache into fast program memory Usually DSPs have no data caches. May have multiple data memories For better real-time performance predictability e.g one for signal data samples and one for filter coefficients DSP Architectural Features Separate memories for data, program Why?

54 CMPE750 - Shaaban #54 lec # 8 Spring 2015 4-2-2015 Conventional “Von Neumann’’ memory AKA unified or Princeton memory architecture

55 CMPE750 - Shaaban #55 lec # 8 Spring 2015 4-2-2015 HARVARD MEMORY ARCHITECTURE in DSP PROGRAM MEMORY X MEMORY Y MEMORY GLOBAL P DATA X DATA Y DATA Multiple memory banks and buses Data Memory Banks (SRAM) e.g one for signal data samples and one for filter coefficients ROM/EPROM/ FLASH? (i.e. split memory)

56 CMPE750 - Shaaban #56 lec # 8 Spring 2015 4-2-2015 DSP Processor Harvard architecture (split) 2-4 memory accesses/cycle No caches: on-chip SRAM General-Purpose Processor Von Neumann architecture Typically 1 access/cycle Use caches Processor Program Memory Data Memory Processor Memory Memory Architecture Comparison Makes real-time performance harder to predict For real-time performance predictability i.e. unified memory but not L1-cache (split)

57 CMPE750 - Shaaban #57 lec # 8 Spring 2015 4-2-2015 TI TMS320C3x MEMORY BLOCK DIAGRAM - Harvard Architecture Multiple memory banks and buses Instruction Cache Data Program Multiple memory banks and buses

58 CMPE750 - Shaaban #58 lec # 8 Spring 2015 4-2-2015 TI 320C62x/67x DSP TI 320C62x/67x DSP (1997) – (Fourth Generation DSP) Program Data

59 CMPE750 - Shaaban #59 lec # 8 Spring 2015 4-2-2015 DSP Addressing Modes Have standard addressing modes: immediate, displacement, register indirect Want to keep MAC datapath busy. Assumption: any extra instructions imply additional clock cycles of overhead in inner loop and larger code size => Thus complex addressing is good Autoincrement/Autodecrement register indirect –lw r1,0(r2)+ => r1 <- M[r2]; r2<-r2+1 –Option to do it before addressing, positive or negative “bit reverse” address addressing mode. “modulo” or “circular” addressing => Don’t use normal datapath integer unit to calculate complex addressing modes: –Instead use dedicated address generation units. To match data access patterns in DSP algorithms and reduce number of instructions (code size) DSP ISA Features Related DSP Architectural Feature Complex & Specialized Examples: Why?

60 CMPE750 - Shaaban #60 lec # 8 Spring 2015 4-2-2015 DSP Addressing: FFT FFTs start or end with data in bufferfly order 0 (000)=> 0 (000) 1 (001)=> 4 (100) 2 (010)=> 2 (010) 3 (011)=> 6 (110) 4 (100)=> 1 (001) 5 (101)=> 5 (101) 6 (110)=> 3 (011) 7 (111)=> 7 (111) How to avoid overhead of address checking instructions for FFT? Have an optional “bit reverse” address addressing mode for use with autoincrement addressing Thus most DSPs have “bit reverse” addressing for radix-2 FFT Bit Reversed Addressing DSP ISA Features

61 CMPE750 - Shaaban #61 lec # 8 Spring 2015 4-2-2015 Bit Reversed Addressing Bit Reversed Addressing Data flow in the radix-2 decimation-in-time FFT algorithm DSP ISA Features

62 CMPE750 - Shaaban #62 lec # 8 Spring 2015 4-2-2015 DSP Addressing: Circular Buffers DSPs dealing with continuous I/O Often interact with an I/O buffer (delay lines) To save memory, buffers often organized as circular buffers What can do to avoid overhead of address checking instructions for circular buffer? Option 1: Keep start register and end register per address register for use with autoincrement addressing, reset to start when reach end of buffer Option 2: Keep a buffer length register, assuming buffers starts on aligned address, reset to start when reach end Every DSP has “modulo” or “circular” addressing and addressing Circular Buffer addressing Sampled signal

63 CMPE750 - Shaaban #63 lec # 8 Spring 2015 4-2-2015 Circular Buffers Addressing Support Every DSP has “modulo” or “circular” addressing mode Instructions accommodate three elements: Buffer address Buffer size Increment Allows for cycling through: delay elements (signal samples) Filter coefficients in data memory Why? DSP ISA Features Or other DSP algorithm coefficients e.g. from A/D e.g. to D/A

64 CMPE750 - Shaaban #64 lec # 8 Spring 2015 4-2-2015 Address calculation for DSPs Dedicated address generation units Supports modulo and bit reversal arithmetic Often duplicated to calculate multiple addresses per cycle DSP Architectural Features DSP Do not use normal integer unit

65 CMPE750 - Shaaban #65 lec # 8 Spring 2015 4-2-2015 Addressing Comparison DSP Processor Dedicated address generation units Specialized addressing modes; e.g.: –Autoincrement –Modulo (circular) –Bit-reversed (for FFT) Good immediate data support General-Purpose Processor Often, no separate address generation units General-purpose addressing modes DSP ISA Feature GPP ISA Feature Number minimized In RISC ISAs DSP Architectural Feature

66 CMPE750 - Shaaban #66 lec # 8 Spring 2015 4-2-2015 DSP Instructions and Execution DSP Instructions and Execution May specify multiple operations in a single complex instruction: –e.g. A compound instruction may perform: multiply + add + load + modify address register Must support Multiply-Accumulate (MAC) Need parallel move support Usually have special loop support to reduce branch overhead –Loop an instruction or sequence –0 value in register usually means loop maximum number of times –Must be sure if calculate loop count that 0 does not mean 0 May have saturating shift left arithmetic May have conditional execution to reduce branches DSP ISA Features To reduce number of instructions and reduce code size In 4 th generation VLIW DSPs Reduce loop overhead

67 CMPE750 - Shaaban #67 lec # 8 Spring 2015 4-2-2015 DSP Low/Zero Overhead Loops Address Generation PCS = PC + 1 if (PC = x && ! condition) PC = PCS else PC = PC +1 DO UNTIL condition” X DO X... Eliminates a few instructions in loops - Important in loops with small bodies In ADSP 2100: Example FIR inner loop on TI TMS320C54xx: DSP ISA Features Number of filter taps Lowers loop overhead Repeat Examples

68 CMPE750 - Shaaban #68 lec # 8 Spring 2015 4-2-2015 Instruction Set (ISA) Comparison DSP Processor Specialized, complex instructions (e.g. MAC) Multiple operations per instruction Zero or reduced overhead loops. General-Purpose Processor General-purpose instructions Typically only one operation per instruction mac x0,y0,a x: (r0) +,x0 y: (r4) +,y0mov *r0,x0 mov *r1,y0 mpy x0, y0, a add a, b mov y0, *r2 inc r0 inc rl Code Size = 16 bits Code Size = 7 x 32 = 224 bits (14X) No zero or reduced overhead loops support Less complex ISA The above is addition to addressing mode differences identified earlier (slide 65) Larger Code Size Smaller Code Size

69 CMPE750 - Shaaban #69 lec # 8 Spring 2015 4-2-2015 Specialized Peripherals for DSPs Synchronous serial ports Parallel ports Timers On-chip A/D, D/A converters Co-processors. ASIC Micro-controller …. Program/data memory and busses Component /system interconnects Host ports Bit I/O ports On-chip DMA controller Clock generators On-chip peripherals often designed for “background” operation, even when DSP core is powered down. Instruction Memory Data Memory A/D Converter D/A Converter Serial Ports DSP Core System on Chip (SoC) Approach DSP Architectural Features Heavy integration of peripherals/components to reduce cost (chip count)/power SOC

70 CMPE750 - Shaaban #70 lec # 8 Spring 2015 4-2-2015 TI TMS320C203/LC203 Block Diagram DSP Core Approach - 1995 Integrated DSP Peripherals Data Program

71 CMPE750 - Shaaban #71 lec # 8 Spring 2015 4-2-2015 Summary of Architectural Features of DSPs Data path configured for DSP –Fixed-point arithmetic –Fast MAC- Multiply-accumulate Multiple memory banks and buses - –Harvard Architecture –Multiple data memories –Dedicated address generation units Specialized addressing modes – Bit-reversed addressing – Circular buffers Specialized instruction set and execution control –Zero-overhead loops –Support for MAC Specialized peripherals for DSP (SoC) THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE DESIGN. (or algorithm driven, DSP algorithms in this case) Avoiding dynamic processor architectural features that make real- time performance harder to predict (e.g dynamic scheduling, hardware speculation, branch prediction, cache). Why? To achieve predictable real-time performance Most common 95% of all DSPs DSP ISA Features DSP ISA Feature DSP Architectural Features DSP Architectural Feature DSP Architectural Feature +

72 CMPE750 - Shaaban #72 lec # 8 Spring 2015 4-2-2015 DSP Software Development Considerations Different from general-purpose software development: –Resource-hungry, complex algorithms. –Specialized and/or complex processor architectures. –Severe cost/storage limitations. –Hard real-time constraints. –Optimization is essential. –Increased testing challenges. Essential tools: –Assembler, linker. –Instruction set simulator. –HLL Code generation: C compiler. –Debugging and profiling tools. Increasingly important: –DSP Software libraries (hand optimized). –Real-time operating systems. Program in DSP Assembly ? HLL/tools becoming more mature/ gaining popularity Was most common (for performance) but changing Requirements Thus

73 CMPE750 - Shaaban #73 lec # 8 Spring 2015 4-2-2015 Classification of Current DSP Architectures Modern Conventional DSPs: –Similar to the original DSPs of the early 1980s –Single instruction/cycle. Example: TI TMS320C54x –Complex instructions/Not compiler friendly Enhanced Conventional DSPs: –Add parallel execution units: SIMD operation –Complex, compound instructions. –Example: TI TMS320C55x –Not compiler friendly Multiple-Issue DSPs: –VLIW Example: TI TMS320C62xx, TMS320C64xx Simpler (RISC-like, fixed-width) instructions than conventional DSPs, more instructions and instruction bandwidth needed More compiler friendly - Higher cost/power SIMD instructions support added to recent DSPs of this class –Superscalar, Example: LSI Logic ZPS400, ZPS500 Second Generation Third Generation Fourth Generation Usually one MAC unit Usually more than one MAC unit DSPs from all these three generations are still available today. Why? Lower Cost/ Power Higher Cost/ Power Performance Late 1980’s - Early 1990’s - Late1990’s - > 1 MAC Unit Larger code size

74 CMPE750 - Shaaban #74 lec # 8 Spring 2015 4-2-2015 A Conventional DSP: TI TMSC54xx 16-bit fixed-point DSP. Issues one 16-bit instruction/cycle Modified Harvard memory architecture Peripherals typical of conventional DSPs: –2-3 synch. Serial ports, parallel port –Bit I/O, Timer, DMA Inexpensive (100 MHz ~$5 qty 10K). Low power (60 mW @ 1.8V, 100 MHz). Second Generation DSP ~ 1989 Has one MAC unit

75 CMPE750 - Shaaban #75 lec # 8 Spring 2015 4-2-2015 A Current Conventional DSP: TI TMSC54xx One MAC Unit Second Generation DSP

76 CMPE750 - Shaaban #76 lec # 8 Spring 2015 4-2-2015 The TMS320C55xx is based on Texas Instruments' earlier TMS320C54xx family, but adds significant enhancements to the architecture and instruction set, including: –Two instructions/cycle Instructions are scheduled for parallel execution by the assembly programmer or compiler. –Two MAC units. Even more complex, compound instructions: –Assembly source code compatible with C54xx –Mixed-width instructions: 8 to 48 bits. –200 MHz @ 1.5 V, ~130 mW, $17 qty 10k Poor compiler target. An Conventional DSP: TI TMSC55xx An Enhanced Conventional DSP: TI TMSC55xx Third Generation DSP ~ 1994 (limited VLIW?) 2 nd generation DSP

77 CMPE750 - Shaaban #77 lec # 8 Spring 2015 4-2-2015 An Conventional DSP: TI TMSC55xx An Enhanced Conventional DSP: TI TMSC55xx 2 MAC Units Third Generation DSP

78 CMPE750 - Shaaban #78 lec # 8 Spring 2015 4-2-2015 16-bit Fixed-Point 8-way VLIW DSP: TI TMS320C6201 Revision 2 (1997 16-bit Fixed-Point 8-way VLIW DSP: TI TMS320C6201 Revision 2 (1997) C6201 CPU Megamodule Data Path 1 D1M1S1L1 A Register File Data Path 2 L2S2M2D2 B Register File Instruction Dispatch Program Fetch Interrupts Control Registers Control Logic Emulation Test Ext. Memory Interface 4-DMA Program Cache / Program Memory 32-bit address, 256-Bit data512K Bits RAM Host Port Interface 2 Timers 2 Multi- channel buffered serial ports (T1/E1) Data Memory 32-Bit address, 8-, 16-, 32-Bit data 512K Bits RAM Pwr Dwn Instruction Decode The TMS320C62xx is the first fixed-point DSP processor from Texas Instruments that is based on a VLIW-like architecture which allows it to execute up to eight 32-bit RISC-like instructions per clock cycle. TMS320C67xx Floating Point version More compiler friendly Higher cost/power SIMD instructions support added to recent DSPs of this class Example Fourth Generation DSP Multiple-Issue DSPs (TMS320C64xx)

79 CMPE750 - Shaaban #79 lec # 8 Spring 2015 4-2-2015 TI TMS320 C62xx Internal Memory Architecture Separate Internal Program and Data Spaces Program –16K 32-bit instructions (2K Fetch Packets) –256-bit Fetch Width –Configurable as either Direct Mapped Cache, Memory Mapped Program Memory Data –32K x 16 –Single Ported Accessible by Both CPU Data Buses –4 x 8K 16-bit Banks 2 Possible Simultaneous Memory Accesses (4 Banks) 4-Way Interleave, Banks and Interleave Minimize Access Conflicts 4 Banks

80 CMPE750 - Shaaban #80 lec # 8 Spring 2015 4-2-2015 TI TMS320 C62xx Datapaths Cross Paths 40-bit Write Paths (8 MSBs) 40-bit Read Paths/Store Paths DDATA_I2 (load data) D2 DS1S2 M1 DS1S2 D1 DS1S2 DDATA_O2 (store data) DADR2 (address) DADR1 (address) DDATA_I1 (load data) DDATA_O1 (store data) 2X1X L 1L 1S1 S2 DL SL DDL S2S1 D M2L2S2 D DL SL DDL S2S1 S2 D S1 Registers B0 - B15Registers A0 - A15 Fourth Generation DSP Example 8-way VLIW

81 CMPE750 - Shaaban #81 lec # 8 Spring 2015 4-2-2015 TI TMS320 C62xx Functional Units L-Unit (L1, L2) –40-bit Integer ALU, Comparisons –Bit Counting, Normalization S-Unit (S1, S2) –32-bit ALU, 40-bit Shifter –Bitfield Operations, Branching M-Unit (M1, M2) –16 x 16 -> 32 D-Unit (D1, D2) –32-bit Add/Subtract –Address Calculations (Statically Scheduled)

82 CMPE750 - Shaaban #82 lec # 8 Spring 2015 4-2-2015 Example 1 TI TMS320C62xx Instruction Packing Instruction Packing Advanced 8-way VLIW Fetch Packet –CPU fetches 8 instructions/cycle Execute Packet –CPU executes 1 to 8 instructions/cycle –Fetch packets can contain multiple execute packets Parallelism determined at compile / assembly time Examples –1) 8 parallel instructions –2) 8 serial instructions –3) Mixed Serial/Parallel Groups A // B C D E // F // G // H Reduces Codesize, Number of Program Fetches, Power Consumption ABCDEFGH A B C D E F G H Example 2 AB C D E FGH Example 3 (Statically Scheduled VLIW)

83 CMPE750 - Shaaban #83 lec # 8 Spring 2015 4-2-2015 Fetch PGPSPWPRDPDCE1E2E3E4E5 DecodeExecute TI TMS320C62xx TI TMS320C62xx Pipeline Operation Pipeline Phases Single-Cycle Throughput Operate in Lock Step Fetch –PGProgram Address Generate –PS Program Address Send –PWProgram Access Ready Wait –PRProgram Fetch Packet Receive Decode –DPInstruction Dispatch –DCInstruction Decode Execute –E1 - E5 Execute 1 through Execute 5 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 2 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 3 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 4 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 5 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 6 PGPSPWPRDPDCE1E2E3E4E5 Execute Packet 7 PGPSPWPRDPDCE1E2E3E4E5

84 CMPE750 - Shaaban #84 lec # 8 Spring 2015 4-2-2015 C62x Pipeline Operation Delay Slots Delay Slots: number of extra cycles until result is: –written to register file –available for use by a subsequent instructions –Multi-cycle NOP instruction can fill delay slots while minimizing code size impact PGPSPWPRDPDCE1 5 Delay SlotsBranch Target E1 Branches E1E2E3E4E5 4 Delay SlotsLoads E1E2 1 Delay SlotsInteger Multiply E1 No DelayMost Instructions (Statically Scheduled VLIW) For better real-time performance predictability

85 CMPE750 - Shaaban #85 lec # 8 Spring 2015 4-2-2015 C6000 Instruction Set Features Conditional Instruction Execution All Instructions can be Conditional (similar to Intel IA-64) –A1, A2, B0, B1, B2 can be used as Conditions –Based on Zero or Non-Zero Value –Compare Instructions can allow other Conditions (, etc) Reduces Branching Increases Parallelism

86 CMPE750 - Shaaban #86 lec # 8 Spring 2015 4-2-2015 C6000 Instruction Set Addressing Features Load-Store Architecture Two Addressing Units (D1, D2) Orthogonal –Any Register can be used for Addressing or Indexing Signed/Unsigned Byte, Half-Word, Word, Double- Word Addressable –Indexes are Scaled by Type Register or 5-Bit Unsigned Constant Index

87 CMPE750 - Shaaban #87 lec # 8 Spring 2015 4-2-2015 C6000 Instruction Set Addressing Modes/Features Indirect Addressing Modes –Pre-Increment *++R[index] –Post-Increment *R++[index] –Pre-Decrement *--R[index] –Post-Decrement *R--[index] –Positive Offset *+R[index] –Negative Offset *-R[index] 15-bit Positive/Negative Constant Offset from Either B14 or B15 Circular Addressing –Fast and Low Cost: Power of 2 Sizes and Alignment –Up to 8 Different Pointers/Buffers, Up to 2 Different Buffer Sizes Bit-reversal Addressing Dual Endian Support

88 CMPE750 - Shaaban #88 lec # 8 Spring 2015 4-2-2015 FIR Filter On TMS320C54xx vs. TMS320C62xx Two filter taps VLIW DSP: Larger code size 2 nd Gen Conventional DSP 4 th Gen VLIW DSP In parallel Smaller code size than VLIW

89 CMPE750 - Shaaban #89 lec # 8 Spring 2015 4-2-2015 TI TMS320C64xx Announced in February 2000, the TMS320C64xx is an extension of Texas Instruments' earlier TMS320C62xx architecture. The TMS320C64xx has 64 32-bit general-purpose registers, twice as many as the TMS320C62xx. The TMS320C64xx instruction set is a superset of that used in the TMS320C62xx, and, among other enhancements, adds significant SIMD/media processing capabilities: –8-bit operations for image/video processing. Introduced at 600 MHz clock speed (1 GHz now), but: –11-stage pipeline with long latencies –Dynamic caches. $100 qty 10k. The only DSP current family with compatible fixed and floating- point versions. Media Processing SIMD Not in C62

90 CMPE750 - Shaaban #90 lec # 8 Spring 2015 4-2-2015 C64xx (also C62xx and C67xx) VLIW have higher memory use due to simpler (RISC-like, fixed-width) instructions than conventional DSPs, more instructions and instruction bandwidth needed, Also VLIW but with variable-length instruction encoding (less memory use than C64xx) (16-32 bits) (VLIW)

91 CMPE750 - Shaaban #91 lec # 8 Spring 2015 4-2-2015 (XScale) Computational

92 CMPE750 - Shaaban #92 lec # 8 Spring 2015 4-2-2015 Superscalar DSP: LSI Logic ZSP400 A 4-way superscalar dynamically scheduled 16-bit fixed- point DSP core. 16-bit RISC-like instructions Separate on-chip caches for instructions and data Two MAC units, two ALU/shifter units –Limited SIMD support. –MACS can be combined for 32-bit operations. Possible Disadvantage: –Dynamic behavior complicates DSP software development: Ensuring real-time behavior Optimizing code. Multiple-Issue 4 th Generation DSPs Example Good or bad for a DSP?

93 CMPE750 - Shaaban #93 lec # 8 Spring 2015 4-2-2015 2004

94 CMPE750 - Shaaban #94 lec # 8 Spring 2015 4-2-2015 2010

95 CMPE750 - Shaaban #95 lec # 8 Spring 2015 4-2-2015 TI not actively improving their flagship FP DSP (fixed-point more important!) GPP (4 th generation TI DSP) 2004


Download ppt "CMPE750 - Shaaban #1 lec # 8 Spring 2015 4-2-2015 Computing Engine Choices General Purpose Processors (GPPs): Intended for general purpose computing (desktops,"

Similar presentations


Ads by Google