Blackfin ADSP Versus Sharc ADSP-21061

Slides:



Advertisements
Similar presentations
Machine cycle.
Advertisements

DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Processor Architecture Needed to handle FFT algoarithm M. Smith.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Jan 28, 2004Blackfin Compute Unit REV B A comparison of DSP Architectures BlackFin ADSP-BFXXX Compute Unit Based on a ENEL white paper prepared by.
Real time DSP Professors: Eng. Julian S. Bruno Eng. Jerónimo F. Atencio Sr. Lucio Martinez Garbino.
Overview of Popular DSP Architectures: TI, ADI, Motorola R.C. Maher ECEN4002/5002 DSP Laboratory Spring 2003.
Parallell Processing Systems1 Chapter 4 Vector Processors.
1 Analog Devices TigerSHARC® DSP Family Presented By: Mike Lee and Mike Demcoe Date: April 8 th, 2002.
What are the characteristics of DSP algorithms? M. Smith and S. Daeninck.
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Software and Hardware Circular Buffer Operations First presented in ENCM There are 3 earlier lectures that are useful for midterm review. M. R.
ENCM 515 Review talk on 2001 Final A. Wong, Electrical and Computer Engineering, University of Calgary, Canada ucalgary.ca.
6/3/20151 ENCM515 Comparison of Integer and Floating Point DSP Processors M. Smith, Electrical and Computer Engineering, University of Calgary, Canada.
Processor Technology and Architecture
1 Architectural Analysis of a DSP Device, the Instruction Set and the Addressing Modes SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software.
1 SHARC ‘S’uper ‘H’arvard ‘ARC’hitecture Nagendra Doddapaneni ER hit HAR ect VARD ure SUP Arc.
20 October 2003WASPAA New Paltz, NY1 Implementation of real time partitioned convolution on a DSP board Enrico Armelloni, Christian Giottoli, Angelo.
Alyssa Concha Microprocessors Final Project ADSP – SHARC Digital Signal Processor.
DSP Chip Architecture Team Members: Steve McDermott Ken Whelan Kyle Welch.
Informationsteknologi Friday, October 19, 2007Computer Architecture I - Class 61 Today’s class Floating point numbers Computer systems organization.
GCSE Computing - The CPU
Jan 28, 2004Blackfin Compute Unit REV B ENEL DSP Architectures BlackFin Compute Unit.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.
Embedded Systems Design
Real time DSP Professors: Eng. Julian Bruno Eng. Mariano Llamedo Soria.
Using Analog Devices’ Blackfin for Embedded Processing Diana Franklin and John Seng.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Computer Architecture and the Fetch-Execute Cycle
A Simple Computer consists of a Processor (CPU-Central Processing Unit), Memory, and I/O Memory Input Output Arithmetic Logic Unit Control Unit I/O Processor.
DSP Lecture Series DSP Memory Architecture Dr. E.W. Hu Nov. 28, 2000.
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,
ECE 456 Computer Architecture
EE421, Fall 1998 Michigan Technological University Timothy J. Schulz 29-Sept, 1998EE421, Lecture 61 Lecture 6 - Sample Processing Methods l Basic building.
Computer Architecture Memory, Math and Logic. Basic Building Blocks Seen: – Memory – Logic & Math.
Overview of Super-Harvard Architecture (SHARC) Daniel GlickDaniel Glick – May 15, 2002 for V (Dewar)
SHARC DSPs SHARC is a DSP architecture designed and fabricated by Analog Devices, Inc. As of Spring 2005, they manufacture and sell 22 DSPs based on the.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Chapter One Introduction to Pipelined Processors
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
The Central Processing Unit (CPU)
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Computer Organization Instructions Language of The Computer (MIPS) 2.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 1: Overview of High Performance Processors * Jeremy R. Johnson Wed. Sept. 27,
CPUz 4 n00bz.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Chapter Overview General Concepts IA-32 Processor Architecture
GCSE Computing - The CPU
William Stallings Computer Organization and Architecture 8th Edition
Embedded Systems Design
Computer Architecture
Digital Signal Processors
Subject Name: Digital Signal Processing Algorithms & Architecture
Overview of SHARC processor ADSP and ADSP-21065L
Overview of SHARC processor ADSP Program Flow and other stuff
Trying to avoid pipeline delays
Getting serious about “going fast” on the TigerSHARC
Digital Signal Processors-1
Overview of SHARC processor ADSP-2106X Memory Operations
Understanding the TigerSHARC ALU pipeline
GCSE Computing - The CPU
ADSP 21065L.
* M. R. Smith 07/16/96 This presentation will probably involve audience discussion, which will create action items. Use PowerPoint.
Presentation transcript:

Blackfin ADSP-21535 Versus Sharc ADSP-21061 By: David W. Rasmussen April 15, 2002

To be covered today: Quick overview of the architectures of the both the Blackfin and Sharc DSPs Main features of both processors Main differences between the processors Code sample for an FIR on the Blackfin Benchmark comparison of three major DSP algorithms

Sharc ADSP-21061[1] DAG 2 8 x 4 x 24 DAG 1 8 x 4 x 32 PROGRAM CACHE MEMORY 32 x 48 PROGRAM SEQUENCER 24 PMD DMD PMA 32 DMA BUS DMA 48 40 JTAG TEST & EMULATION FLAGS FLOATING & FIXED-POINT MULTIPLIER, FIXED-POINT ACCUMULATOR 32-BIT BARREL SHIFTER FLOATING-POINT & FIXED-POINT ALU REGISTER FILE 16 x 40 BUS CONNECT TIMER BUS

Sharc’s Main Features[2]: 32/40-bit IEEE floating-point math 32-bit fixed-point MACs with 64-bit product and 80-bit accumulation No arithmetic pipeline; Thus all computations are single-cycle Circular Buffer Addressing supported in hardware 32 address pointers support 32 circular buffers 16 48-bit Data Registers

Sharc’s Main Features Cont.: Six nested levels of zero-overhead looping in hardware Four busses to memory (2 DM + 2 PM) 1 Mbit on-chip Dual Ported SRAM Maximum processing of 50 MIPS Possibility of four parallel operations processed in one clock cycle +/-, *, DM, PM Assuming Pipeline is full PM clashing – utilize Instruction Cache

Blackfin ADSP-21535[3]

Blackfin’s Main Features[4]: Two 16-bit MACs, two 40-bit ALUs, and four 8-bit Video ALUs Support for 8/16/32-bit integer and 16/32-bit fractional data types Concurrent fetch of one instruction and two unique data elements Two loop counters that allow for nested zero-overhead looping Two DAG units with circular and bit-reversed addressing 600 MHz core clock performing 600 MMACs

Blackfin’s Main Features Cont.: Possibility of the following parallel operations processed in one clock cycle Execution of a single instruction operating on both MACs or ALUs and Execution of two 32-bit Data Moves (either 2 Reads or 1 Read/1 Write) and Execution of two pointer updates and Execution of hardware loop update

Main Differences: The Blackfin is only a 16-bit integer processor, however can operate on 32-bit data values. If 32-bit data value used: Either one or two ALU operations can be performed in one clock cycle One MAC can be obtained however will take more than one clock cycle The Sharc is a 32-bit Floating Point processor

Main Differences Cont.: The Blackfin has 4 address registers (with corresponding base, length, and modify) to use for circular buffers versus the Sharc’s 32 The Blackfin has 2 nested hardware loops where the Shark has 6 The Blackfin has an 8 stage pipeline (fetch 1-2, decode, execute 1-3, writeback) where the Shark has a 3 stage The Blackfin is clocked six times faster (300 MHz versus 50 MHz)

Blackfin FIR Code Sample[5]: LSETUP(E_FIR_START,E_FIR_END) LC0=P1>>1; //Loop 1 to Ni/2 E_FIR_START: R1=PACK(R1.H,R0.H) || [I0++]=R0 || R2.L=W[I2++]; //Store X1 into the lower half of R1. //Update the delay line. //Fetch h0 into lower half of R2 LSETUP(E_MAC_ST,E_MAC_END)LC1=P2>>1;//Loop 1 to Nc/2 - 1 A1=R2.L*R1.L, A0=R2.H*R1.H || R2.H=W[I2++] || [I3++]=R3; //A1=h0*X1, A0=hn-1*X-n+1. //Fetch h1 into upper half of R2. //Store the output. E_MAC_ST: A1+=R0.L*R2.H,A0+=R0.L*R2.L || R2.L=W[I2++] || R0=[I1--]; //A1+=X0*h1, A0+=X0*h0 //Fetch filter coeff. h2 into the lower //half of R2. Fetch X-1 and X-2 into the //upper and lower half of R0 (for the //first time in this loop) E_MAC_END: A1+=R0.H*R2.L,A0+=R0.H*R2.H || R2.H=W[I2++] ; //A1+=X-1*h2, A0+=X-1*h1 //Fetch h3 into the upper half of R2. //(for the first time in this loop) E_FIR_END: R3.H=(A1+=R0.L*R2.H),R3.L=(A0+=R0.L*R2.L) || R0=[P0++] || R1=[I0]; //A1+=X-n+2*hn-1, A0+=X-n+2*hn+2 //Fetch the next pair of inputs (X2 and X3) into lower //and upper half of R0. Fetch X-n+2 and X-n+3 into R1 ...

Benchmarks: For the Sharc[6] For the Blackfin[7] Algorithm Type Time Cycles 1024-pt complex FFT 0.37 ms 18,221 FIR Filter (per Tap) 20 ns 1 IIR Filter (per Biquad) 80 ns 4 For the Blackfin[7] Algorithm Type Time Cycles 256-pt Complex FFT 0.0106 ms 3,176 FIR Filter (per Tap) 13.33 ns 4 IIR Filter (per Biquad) 20 ns 6

Analysis: Blackfin is faster for the three algorithms Unsure of exact performance gain on the FFT (as different lengths) but is somewhere between 2-9 times faster Both the FIR and IIR took more cycles to complete on the Blackfin as more cycles are required for 32-bit operations

References ENCM515 Lecture Slides for January 11, 2002, [http://www.enel.ucalgary.ca/People/Smith/2002webs/encm515_02/02presentations/02january/02overviewSHARCarchitecture.ppt], Dr. Mike Smith Sharc Architecture Overview, [http://www.analog.com/technology/dsp/Sharc/architecture.html], Analog Devices DSP Manuals, [http://www.analog.com/library/dspManuals/pdf/21535/overview.pdf], Analog Devices Blackfin Architecture Overview, [http://www.analog.com/technology/dsp/Blackfin/architecture/basics.html], Analog Devices FIR Blackfin Code Example, [ftp://ftp.analog.com/pub/dsp/blackfin/examples/fir_032101.zip], Analog Devices Sharc DSP Data Sheet, [http://www.analog.com/productSelection/pdf/ADSP-20161_L_b.pdf], Analog Devices Blackfin DSP Benchmark Comparison, [http://www.analog.com/technology/dsp/Blackfin/benchmarks/examples.html], Analog Devices

Special Thanks To: Mike Roest for the use of his individual assignment – entitled “Examination of the Analog Devices Blackfin and SHARC 21061”, Submitted March 12, 2002 – as preliminary research material for this report.