Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Slides:



Advertisements
Similar presentations
FINITE WORD LENGTH EFFECTS
Advertisements

FIGURE 11.1 Discrete Time Signals.. FIGURE 11.2 Step Function.
Computer Architecture
© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Is There a Real Difference between DSPs and GPUs?
DSPs Vs General Purpose Microprocessors
PIPELINE AND VECTOR PROCESSING
ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Computer Organization and Architecture
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
INSTRUCTION SET ARCHITECTURES
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
The University of Adelaide, School of Computer Science
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Parallell Processing Systems1 Chapter 4 Vector Processors.
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
ECE 353 Introduction to Microprocessor Systems Michael G. Morrow, P.E. Week 14.
CENG536 Computer Engineering Department Çankaya University.
Processor Technology and Architecture
CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.
Digital Kommunikationselektronik TNE027 Lecture 4 1 Finite Impulse Response (FIR) Digital Filters Digital filters are rapidly replacing classic analog.
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Digital Communication Techniques
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
GPGPU platforms GP - General Purpose computation using GPU
Prepared by: Hind J. Zourob Heba M. Matter Supervisor: Dr. Hatem El-Aydi Faculty Of Engineering Communications & Control Engineering.
Hossein Sameti Department of Computer Engineering Sharif University of Technology.
- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”
Real time DSP Professors: Eng. Julian Bruno Eng. Mariano Llamedo Soria.
Practical PC, 7th Edition Chapter 17: Looking Under the Hood
Fixed-Point Arithmetics: Part II
Floating Point vs. Fixed Point for FPGA 1. Applications Digital Signal Processing -Encoders/Decoders -Compression -Encryption Control -Automotive/Aerospace.
Develop and Implementation of the Speex Vocoder on the TI C64+ DSP
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
ECEG-3202: Computer Architecture and Organization, Dept of ECE, AAU 1 Floating-Point Arithmetic Operations.
DSP Lecture Series DSP Memory Architecture Dr. E.W. Hu Nov. 28, 2000.
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
DSP Processors We have seen that the Multiply and Accumulate (MAC) operation is very prevalent in DSP computation computation of energy MA filters AR filters.
Ch.5 Fixed-Point vs. Floating Point. 5.1 Q-format Number Representation on Fixed-Point DSPs 2’s Complement Number –B = b N-1 …b 1 b 0 –Decimal Value D.
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –
Research Progress Seminar
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
Real-time Digital Signal Processing Digital Filters.
Digital Signal Processor HANYANG UNIVERSITY 학기 Digital Signal Processor 조 성 호 교수님 담당조교 : 임대현
Unit 1 Introduction Number Systems and Conversion.
Embedded Systems Design
Vector Processing => Multimedia
Digital Signal Processors
Pipelining and Vector Processing
CSCE Fall 2013 Prof. Jennifer L. Welch.
Lect5 A framework for digital filter design
Multiplier-less Multiplication by Constants
CSCE Fall 2012 Prof. Jennifer L. Welch.
Real time signal processing
Presentation transcript:

Kathy Grimes

Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog Repeatability Tolerances Difficulty storing information or implementing certain operations Leads us to DSP…

Represent signals by sequences of numbers Pros Repeatable Accuracy can be controlled Time-varying operations are easier to implement Cons Sampling cause loss of information Round-off errors A/D and D/A mixed-signal hardware

Analog to Digital Converter Continuous to Discrete time signal 11.1 shows the sampling of a signal Common Signals Step Discontinuity (Figure 11.2) Impulse (Figure 11.3) FIGURE 11.1 Discrete Time Signals. FIGURE 11.2 Step Function. FIGURE 11.3 Impulse Function.

Based off of three basic functions: Delay Add Multiply Raw Performance for DSP algorithm is usually by # of ops needed to execute FIGURE 11.4 Add Function. FIGURE 11.5 Multiply Function. FIGURE 11.6 Delay Function.

These two systems in combination can be used to develop any discrete difference equation FIGURE 11.7 Feedforward System. FIGURE 11.8 Feedback System.

Floating-Point DSP perform Integer Operation Dynamic operating range Fixed-Point DSP perform Integer and Floating Operation Fixed range – 16 bit = max range Analog world signals = infinite precision Floating-point mimic the “infinite” range better Easier to implement, avoids rounding and overflow errors Why not always use Floating-point? Cost, Availability, Price, and Performance Precision  Floating Point is good for smaller values but is poorer at larger values using same number of bits

SIMD Microarchitecture and Instructions One clock cycle for 4 data x(1 instruction)x 1 value Increase of performance for low-level DSP functions (MAC) FIGURE SIMD Instruction.

Processor Clockspeed Cache size Usually DSP architectures manually partition the memory space in order to reduce number of accesses to external memory Latency = costly in terms of time and resources Intel architectures have large amounts of cache and can overcome the fast/slow memory, however, all memory starts in “far” caches Output data should be generated sequentially  Accessing memory in a scattered pattern (while using threads) should be avoided

Intrinsic Vectorization Intel Performance Primitives

C code that calls special built-in compiler capabilities that map closely to underlying SSE instruction set Added Data Types _m64, _m128, _m128d, _m128i Intrinsic Operation Types Arithmetic (fixed- and floating-point) Shift Logical Compare Set Shuffle Concatenation Adds four FP values packed into a and b and performs four additions in one instruction

Use compiler to apply vectorization techniques to loops within data processing iteration  looks for opportunities to convert loops from single set to vector-based implementation (so that multiple operands can be operated at the same time) Like GCC -- >aligned with SIMD instruction set Use #pragma directives to guide compiler to avoid overheads such as data dependces Listing 11.4 Explicitly Don’t Vectorize Loop. Listing 11.7 Memory Alignment Property and Discarding Assumed Data Dependences.

Comparisons on Performance This performance would be vastly different if the memory was not already aligned

Intel Libraries – highly optimized implementations for many different applications (include audio codecs, image processing, data compression, etc…) Libraries take full advantage of CPU and SIMD (and most are written for performance) Libraries are threaded and can obtain performance gains by parallelizing the algorithm Libraries that take advantage are: Signal Processing – Convolution and correlation, Finite impulse response (FIR) filter, FIR coefficints generation function, Infinite response filter (IIR), Transforms Image Processing Small Matrices and Realistic Rendering Cryptography

FIR filter equation Y[n] = a.x[n] + b.x[n-1] + c.x[n-2] Listing 11.8 FIR Filter C Code Example Listing 11.9 FIR Using Intel Performance Primitives.

Loop Unrolling to get rid of data dependences By changing the data elements, we can reduce the number of times we need to read data

Computation intensive Needs a significant amount of embedded computational performance Same basic algorithmic pattern  even though physical configurations, parameters, and functionality are different Beam forming Envelope Extraction Polar-to-Cartesian coordinate translation

FIGURE Block Diagram of a Typical Ultrasound Imaging Application.

FIGURE Block Diagram of the Envelope Detector.

FIGURE Polar-to-Cartesian Conversion of a Hypothetically Scanned Rectangular Object. Listing Code Sample for Envelope Detector.

Why such a large difference?

Digital Signal Processing in general-purpose processors Extend Processing Capabilities Simplifies overall application when platforms require Control, Communications, and General-purpose processing w/DSP Many ways to improve an Intel system by implementing special C code, vectorization, and specific libraries Performance is greatly enhanced when DSP is implemented properly