Number Representation for

Slides:

Advertisements

Similar presentations

Fixed Point Numbers The binary integer arithmetic you are used to is known by the more general term of Fixed Point arithmetic. Fixed Point means that we.

Advertisements

CENG536 Computer Engineering department Çankaya University.

Topics covered: Floating point arithmetic CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Assembly Language and Computer Architecture Using C++ and Java

Number Systems Standard positional representation of numbers:

1 Lecture 3 Bit Operations Floating Point – 32 bits or 64 bits 1.

Assembly Language and Computer Architecture Using C++ and Java

Representation and Conversion of Numeric Types 4 We have seen multiple data types that C provides for numbers: int and double 4 What differences are there.

Computer ArchitectureFall 2008 © August 27, CS 447 – Computer Architecture Lecture 4 Computer Arithmetic (2)

Binary Number Systems.

The Binary Number System

Simple Data Type Representation and conversion of numbers

ELEN 5346/4304 DSP and Filter Design Fall Lecture 12: Number representation and Quantization effects Instructor: Dr. Gleb V. Tcheslavski Contact:

Computer Organization and Architecture Computer Arithmetic Chapter 9.

Computer Arithmetic Nizamettin AYDIN

Computer Arithmetic. Instruction Formats Layout of bits in an instruction Includes opcode Includes (implicit or explicit) operand(s) Usually more than.

Lecture 09a Numerical Issues. Lecture 09a, Slide 2 Learning Objectives  Numerical issues and data formats.  Fixed point.  Fractional number.  Floating.

1 Lecture 5 Floating Point Numbers ITEC 1000 “Introduction to Information Technology”

NUMBER REPRESENTATION CHAPTER 3 – part 3. ONE’S COMPLEMENT REPRESENTATION CHAPTER 3 – part 3.

Dale Roberts Department of Computer and Information Science, School of Science, IUPUI CSCI 230 Information Representation: Negative and Floating Point.

Fixed-Point Arithmetics: Part II

IT253: Computer Organization

Number Systems So far we have studied the following integer number systems in computer Unsigned numbers Sign/magnitude numbers Two’s complement numbers.

Computing Systems Basic arithmetic for computers.

Computer Architecture

ECE232: Hardware Organization and Design

Data Representation in Computer Systems

Floating Point (a brief look) We need a way to represent –numbers with fractions, e.g., –very small numbers, e.g., –very large numbers,

CH09 Computer Arithmetic  CPU combines of ALU and Control Unit, this chapter discusses ALU The Arithmetic and Logic Unit (ALU) Number Systems Integer.

Oct. 18, 2007SYSC 2001* - Fall SYSC2001-Ch9.ppt1 See Stallings Chapter 9 Computer Arithmetic.

©Brooks/Cole, 2003 Chapter 3 Number Representation.

CSC 221 Computer Organization and Assembly Language

ITEC 1011 Introduction to Information Technologies 4. Floating Point Numbers Chapt. 5.

Ch.5 Fixed-Point vs. Floating Point. 5.1 Q-format Number Representation on Fixed-Point DSPs 2’s Complement Number –B = b N-1 …b 1 b 0 –Decimal Value D.

Fixed & Floating Number Format Dr. Hugh Blanton ENTC 4337/5337.

CSPP58001 Floating Point Numbers. CSPP58001 Floating vs. fixed point Floating point refers to a binary decimal representation where there is not a fixed.

Computer Engineering FloatingPoint page 1 Floating Point Number system corresponding to the decimal notation 1,837 * 10 significand exponent A great number.

CS1Q Computer Systems Lecture 2 Simon Gay. Lecture 2CS1Q Computer Systems - Simon Gay2 Binary Numbers We’ll look at some details of the representation.

Binary Arithmetic.

Data Representation: Floating Point for Real Numbers Computer Organization and Assembly Language: Module 11.

©Brooks/Cole, 2003 Chapter 3 Number Representation.

Binary Numbers The arithmetic used by computers differs in some ways from that used by people. Computers perform operations on numbers with finite and.

Cosc 2150: Computer Organization Chapter 9, Part 3 Floating point numbers.

Chapter 9 Computer Arithmetic

William Stallings Computer Organization and Architecture 8th Edition

CHAPTER 5: Representing Numerical Data

Floating Point Representations

Binary Numbers The arithmetic used by computers differs in some ways from that used by people. Computers perform operations on numbers with finite and.

Dr. Clincy Professor of CS

Introduction To Computer Science

William Stallings Computer Organization and Architecture 7th Edition

Data Structures Mohammed Thajeel To the second year students

Data Representation Data Types Complements Fixed Point Representation

ECEG-3202 Computer Architecture and Organization

Presentation transcript:

Number Representation for DSP processor

Numerical issues and data formats. Fixed point. Fractional number. Floating point. Comparison of formats and dynamic ranges.

Numerical Issues and Data Formats C6000 Numerical Representation Fixed point arithmetic: 16-bit (integer or fractional). Signed or unsigned. Floating point arithmetic: 32-bit single precision. 64-bit Double precision.

How are numbers represented and processed in DSP processors for implementing DSP algorithms?

How numbers are represented? A collection of N binary digits (bits) has 2N possible states. This can be seen from elementary counting theory, which tells us that there are two possibilities for the first bit, two possibilities for the next bit, and so on until the last bit, resulting in 2×2×2… = 2N possibilities or states. In the most general sense, we can allow these states to represent anything conceivable. The point is that there is no meaning inherent in a binary word, although most people are tempted to think of them as positive integers. However, the meaning of an N-bit binary word depends entirely on its interpretation.

Fixed Point number and arithmetic Binary representation of 4-bit signed number Decimal Value Sign Magnitude One’s Complement Two’s Complement 7 0111 1 0001 +0 0000 -0 1000 1111 - -1 1001 1110 -7 -8

3 BIT NUMBER IN SIGNED 2’S COMPLEMENT REPRESENTATION

Dynamic range of integer and fraction 4-bit numbers Unsigned integer Signed Integer Smallest Value:0000(0) Largest value :1111(15) Most Positive Value:0111= (+7) Least negative value:1000 =(-8) Unsigned Fraction Smallest Value:.0000(0) Largest value :.1111(0.9375) Most Positive Value:0.111= (+0.875) Least negative value:1.000 =(-1)

Dynamic Range Dynamic range in dB= 20 log10(Max/min) In fixed point Unsigned integer representation using N-bit range of Max to min number is 2N to 1. In fixed point signed integer representation using N-bit range of Max to min number is 2N-1 to 1.

In fixed point Unsigned fraction representation using N-bit range of Max to min number is 1-2-N to 2-N. In fixed point signed fraction representation using N-bit range of Max to min number is 1-2-N+1 to 2-N+1.

Number format Dynamic Range Dynamic Range in dB Precision 1 2^-16 Unsigned Integer 0 to 65536 20log10(2^16)= 96 dB 1 signed Integer -32768 to 32767 20log10(2^15)= 90 dB Unsigned Fraction 0 to 0.99998474 96 dB 2^-16 signed Fraction -1 to 0.99996948 90 dB 2^-15

Fixed Point Arithmetic - Problems The following equation is the basis of many DSP algorithms (See Chapter 1): Two problems arise when using signed and unsigned integers: Multiplication overflow. Addition overflow.

Multiplication Overflow 16-bit x 16-bit = 32-bit Example: using 4-bit representation 24 cannot be represented with 4-bits. 1 3 1 x x 8 24 1 1

Addition Overflow 32-bit + 32-bit = 33-bit Example: using 4-bit representation 16 cannot be represented with 4-bits. 1 8 1 + + 8 16 1

Fixed Point Arithmetic - Solution The solutions for reducing the overflow problem are: Saturate the result. Use double precision result. Use fractional arithmetic. Use floating point arithmetic.

Solution - Saturate the result Unsigned numbers: If A x B  15  result = A x B If A x B > 15  result = 15 1 3 1 8 x 1 1 24 Saturated 1 15

Solution - Saturate the result Signed numbers: If -8  A x B  7  result = A x B If A x B > 7  result = 7 If A x B < -8  result = -8 1 3 1 -8 x 1 1 -24 Saturated 1 -8

Solution - Double precision result For a 4-bit x 4-bit multiplication hold the result in an 8-bit location. Problems: Uses more memory for storing data. If the result is used in another multiplication the data needs to be represented into single precision format (e.g. prod = prod x sum). Results need to be scaled down if it is to be sent to an A to D converter.

Solution - Fractional arithmetic If A and B are fractional then: A x B < min(A, B) i.e. The result is less than the operands hence it will never overflow. Examples: 0.6 x 0.2 = 0.12 (0.12 < 0.6 and 0.12 < 0.2) 0.9 x 0.9 = 0.81 (0.81 < 0.9) 0.1 x 0.1 = 0.01 (0.01 < 0.1)

Fractional numbers - Sign Extension 1 a= = 0.5 + 0.25 = 0.75 b= = -1 + 0.5 + 0.25 = -0.25 x 1 . 1 . 1 . 1 Sign extension To keep the same resolution as the operands we need to select these 4-bits: 1

Fractional numbers - Sign Extension 1 1 = 0.5 + 0.25 = 0.75 x b= 1 1 1 = -1 + 0.5 + 0.25 = -0.25 1 1 . 1 1 . . Sign extension bits 1 1 1 . . . 1 Sign extension The way to do it is to shift left by one bit and store upper 4-bits or right shift by three and store the lower 4-bits: 1 1 1

Although using 2's complement integers we can implement both addition and subtraction by usual binary addition (with special care for the sign bit), the integers are not convenient to handle to implement DSP algorithms. For example, if we multiply two 8-bit words together, we need 16 bits to store the result.

The number of required word length increases without bound as we multiply numbers together more. Although not impossible, it is complicated to handle this increase in word-length using integer arithmetic. The problem can be easily handled by using numbers between -1 and 1, instead of integers, because the product of two numbers in [-1,1] are always in the same range.

In the 2's complement fractional representation, an N bit binary word can represent 2N equally space numbers from For example, we interpret an 8-bit binary word b7 b6 b5 b4 b3 b2 b1 b0 as a fractional number This representation is called Q-format.

Integer Vs Fractional Number Different notation are used to represent different binary formats. The Qm.n represent are most widely used In Qm.n representation m bit representation integer portion and n bit represent fraction number Eg Q15.0 and Q0.15 If N is total number of bits then N=m+n+1.

Number range is -2^m to 2^m – 2^-n Its resolution is 2^-n For eg Q14.1 representation Its range is [-2^14, 2^14 – 2^-1] = [-16384.0, +16383.5] = So in hex [0x8000, 0x8001 … 0xFFFF, 0x0000, 0x0001 … 0x7FFE, 0x7FFF] resolution is 2^-1=0.5

To convert a number from floating point to Qm.n format: Conversion To convert a number from floating point to Qm.n format: 1.Multiply the floating point number by 2n 2.Round to the nearest integer

To convert a number from Qm.n format to floating point Conversion To convert a number from Qm.n format to floating point 1.Convert the number directly to floating point 2.Divide by 2^n

In C6211, it is easiest to handle Q-15 numbers represented by each 16 bit binary word, because the multiplication of two Q-15 numbers results in a Q-30 number that can still be stored in a 32-bit wide register of C6211. The programmer needs to keep track of the implied binary point when manipulating Q-format numbers.

15-bit * 15-bit Multiplication CPU MPY A3,A4,A6 NOP Q15 s. x y x Q15 s z Q30 Store to Data Memory SHR A6,15,A6 STH A6,*A7 s. y Q15

Assembly language implementation When A0 and A1 contain two 16-bit numbers in the Q-15 format, we can perform the multiplications using MPY followed by a right shift. 1 MPY .M1 A0,A1,A2 2 NOP 3 SHR .S1 A2,15,A2 ;lower 16 bit contains result in Q-15 format

C language implementation Let's suppose we have two 16-bit numbers in Q-15 format, stored in variable x and y as follows: short x = 0x0011; /* 0.000518799 in decimal */ short y = 0xfe12; /* -0.015075684 in decimal */ short z; /* variable to store x*y */ The product of x and y can be computed and stored in Q-15 format as follows: z = (x * y) > > 15; The result of x*y is a 32-bit word with 2 sign bits. Right shifting it by 15 bits ignores the last 15 bits, and storing the shifted result in z that is a short variable (16 bit) removes the extended sign bit by taking only lower 16 bits.

However, care must be taken when adding binary numbers. Because each Q-15 number can represent numbers in the range [-1,1−215] , if the result of summing two Q-15 numbers is not in this range, we cannot represent the result in the Q-15 format. When this happens, we say an overflow has occurred.

Unless carefully handled, the overflow makes the result incorrect. Therefore, it is really important to prevent overflows from occurring when implementing DSP algorithms. One way of avoiding overflow is to scale all the numbers down by a constant factor, effectively making all the numbers very small, so that any summation would give results in the [-1,1) range. This scaling is necessary and it is important to figure out how much scaling is necessary to avoid overflow. Because scaling results in loss of effective number of digits, increasing quantization errors, we usually need to find the minimum amount of scaling to prevent overflow.

Most Fixed-Point DSP processor use two’s complement fractional numbers in different Q format. However assembler only recognize integer value. So we have to keep track of binary point when considering fractional number in assembly program.

3. Round the product to the nearest integer. Steps of converting fractional number into Q-format into an integer that can be recognized by the assembler 1. Normalize the fractional number to the range determined by the desired Q-format. 2. Multiply the Normalized fractional by 2^n, where n is the total number of fractional bits. 3. Round the product to the nearest integer.

‘C6000 C Data Types Type Size Representation char, signed char 8 bits ASCII unsigned char 8 bits ASCII short 16 bits 2’s complement unsigned short 16 bits binary int, signed int 32 bits 2s complement unsigned int 32 bits binary long, signed long 40 bits 2’s complement unsigned long 40 bits binary enum 32 bits 2’s complement float 32 bits IEEE 32-bit double 64 bits IEEE 64-bit long double 64 bits IEEE 64-bit pointers 32 bits binary

Exponential Notation The following are equivalent representations of 1,234 123,400.0 x 10-2 12,340.0 x 10-1 1,234.0 x 100 123.4 x 101 12.34 x 102 1.234 x 103 0.1234 x 104 The representations differ in that the decimal place – the “point” -- “floats” to the left or right (with the appropriate adjustment in the exponent). p. 122

Parts of a Floating Point Number Sign of mantissa Location of decimal point Mantissa Exponent Sign of exponent Base -0.9876 x 10-3

IEEE 754 Standard Most common standard for representing floating point numbers Single precision: 32 bits, consisting of... Sign bit (1 bit) Exponent (8 bits) Mantissa (23 bits) Double precision: 64 bits, consisting of… Exponent (11 bits) Mantissa (52 bits) p. 133

Single Precision Format 32 bits Mantissa (23 bits) Exponent (8 bits) Sign of mantissa (1 bit)

Normalization The mantissa is normalized Has an implied decimal place on left Has an implied “1” on left of the decimal place E.g., Mantissa  Represents… 10100000000000000000000 1.1012 = 1.62510

Excess Notation To include +ve and –ve exponents, “excess” notation is used Single precision: excess 127 Double precision: excess 1023 The value of the exponent stored is larger than the actual exponent E.g., excess 127, Exponent  Represents… 10000111 135 – 127 = 8

Example Single precision +1.112 x 23 = 1110.02 = 14.010 0 10000010 11000000000000000000000 1.112 130 – 127 = 3 0 = positive mantissa +1.112 x 23 = 1110.02 = 14.010

Hexadecimal It is convenient and common to represent the original floating point number in hexadecimal The preceding example… 0 10000010 11000000000000000000000 4 1 6

Converting from Floating Point E.g., What decimal value is represented by the following 32-bit floating point number? C17B000016

Step 1 Express in binary and find S, E, and M C17B000016 = 1 10000010 111101100000000000000002 S E M 1 = negative 0 = positive

Step 2 Find “real” exponent, n n = E – 127 = 100000102 – 127 = 130 – 127 = 3

Step 3 Put S, M, and n together to form binary result (Don’t forget the implied “1.” on the left of the mantissa.) -1.11110112 x 2n = -1.11110112 x 23 = -1111.10112

Answer: -15.6875 Step 4 Express result in decimal -1111.10112 -15 2-1 = 0.5 2-3 = 0.125 2-4 = 0.0625 0.6875 Answer: -15.6875

Converting to Floating Point E.g., Express 36.562510 as a 32-bit floating point number (in hexadecimal)

Step 1 Express original value in binary 36.562510 = 100100.10012

Step 2 Normalize 100100.10012 = 1.0010010012 x 25

Step 3 Determine S, E, and M +1.0010010012 x 25 n E = n + 127 = 5 + 127 = 132 = 100001002 S M S = 0 (because the value is positive)

Step 4 Put S, E, and M together to form 32-bit binary result 0 10000100 001001001000000000000002 S E M

Answer: 4212400016 Step 5 Express in hexadecimal 0 10000100 001001001000000000000002 = 0100 0010 0001 0010 0100 0000 0000 00002 = 4 2 1 2 4 0 0 016 Answer: 4212400016

Comparison of fixed point and floating point Limited Dynamic range Large Dynamic Range Overview flow and quantization errors must be resolved Easier to program since no scaling is required Long product development time Quick time to market Cheaper More expensive Lower power consumption High Power consumption

Fixed Point Floating point Consumer Application such as MP3 player ,multimedia gaming, and digital camera High-eng audio application such as ambient acoustic simulation ,audio encoding/decoding and audio mixing

Thank you