Introduction to MMX, XMM, SSE and SSE2 Technology

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

PIPELINE AND VECTOR PROCESSING

Integer Arithmetic: Multiply, Divide, and Bitwise Operations

Streaming SIMD Extension (SSE)

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

INSTRUCTION SET ARCHITECTURES

The University of Adelaide, School of Computer Science

Fixed Point Numbers The binary integer arithmetic you are used to is known by the more general term of Fixed Point arithmetic. Fixed Point means that we.

ECE291 Computer Engineering II Lecture 24 Josh Potts University of Illinois at Urbana- Champaign.

Chapter 3 Arithmetic for Computers. Exam 1 CSCE

1. Microprocessor. mp mp vs. CPU Intel family of mp General purpose mp Single chip mp Bit slice mp.

Computer Organization and Assembly Languages Yung-Yu Chuang

Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.

This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.

IA-32 Processor Architecture

Data Representation COE 205

1 ICS 51 Introductory Computer Organization Fall 2006 updated: Oct. 2, 2006.

High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion.

Unit-1 PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE Advance Processor.

CS854 Pentium III group1 Instruction Set General Purpose Instruction X87 FPU Instruction SIMD Instruction MMX Instruction SSE Instruction System Instruction.

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

CMPE 511 Computer Architecture Caner AKSOY CmpE Boğaziçi University December 2006 Intel ® Core 2 Duo Desktop Processor Architecture.

AMD Opteron - AMD64 Architecture Sean Downes. Description Released April 22, 2003 The AMD Opteron is a 64 bit microprocessor designed for use in server.

© Janice Regan, CMPT 128, Jan CMPT 128: Introduction to Computing Science for Engineering Students Integer Data representation Addition and Multiplication.

NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.

Assembly Language for Intel-Based Computers, 4 th Edition Chapter 2: IA-32 Processor Architecture (c) Pearson Education, All rights reserved. You.

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

Practical PC, 7th Edition Chapter 17: Looking Under the Hood

Machine Instruction Characteristics

Lec 3: Data Representation Computer Organization & Assembly Language Programming.

Computer Architecture and Operating Systems CS 3230 :Assembly Section Lecture 10 Department of Computer Science and Software Engineering University of.

Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.

History of Microprocessor MPIntroductionData BusAddress Bus

MMX technology for Pentium. Introduction Multi Media Extension (MMX) for Pentium Processor Which has built in 80X87 Can be switched for multimedia computations.

1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.

TK2633 : MICROPROCESSOR & INTERFACING Lecture 10: Fixed Point Arithmetic Lecturer: Ass. Prof. Dr. Masri Ayob.

December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?

CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.

November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.

With a focus on floating point.  For floating point (i.e., real numbers), MASM supports:  real4  single precision; IEEE standard; analogous to float.

Chapter 4 Integer Data Representation. Unsigned Integers.

Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.

Copyright © Curt Hill SIMD Single Instruction Multiple Data.

SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.

Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

CPIT Program Execution. Today, general-purpose computers use a set of instructions called a program to process data. A computer executes the.

CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.

SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.

1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.

Arithmetic for Computers Chapter 3 1. Arithmetic for Computers  Operations on integers  Addition and subtraction  Multiplication and division  Dealing.

Chapter Overview General Concepts IA-32 Processor Architecture

GCSE OCR Computing A451 The CPU Computing hardware 1.

Single Instruction Multiple Data

A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.

Exploiting Parallelism

Morgan Kaufmann Publishers

Chapter 6 Floating Point

MMX technology for Pentium

MMX Multi Media eXtensions

Special Instructions for Graphics and Multi-Media

CS170 Computer Organization and Architecture I

Digital System Design II 数字系统设计2

COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING

Morgan Kaufmann Publishers Arithmetic for Computers

Other Processors Having learnt MIPS, we can learn other major processors. Not going to be able to cover everything; will pick on the interesting aspects.

MMX technology for Pentium

Computer Organization and Assembly Language

Presentation transcript:

Introduction to MMX, XMM, SSE and SSE2 Technology Multimedia Extension, Streaming SIMD Extension 11/23/98, 5/6/99, 2/5/03, 5/10/04, 5/4/05

SISD - Single Instruction, Single Data Traditional computers In general, one instruction processes one data item Control Unit Memory Execution Unit

SIMD - Single Instruction, Multiple Data One instruction can process multiple data items Useful when large amounts of regularly organized data is processed Example: Matrix and vector calculations This is the basis of MMX and XMM Control Unit Memory Execution Units

MISD MISD: Multiple instructions process one data item. Memory Control Unit Execution Units MISD: Multiple instructions process one data item.

MIMD MIMD: Multiple instructions process multiple data items. Control Unit Memory Execution Unit

Your Turn How would you classify a traditional computer under this system? How would you classify a Shemp which has multiple processors? How would you classify a computer having a Intel Dual Core processor?

Potential Applications MMX and SSE graphics MEG video/image processing music synthesis speech compression/recognition video conferencing matrix and vector calculations Advanced 3D graphics (SSE2) Speech recognition (SSE2) Scientific and engineering applications (SSE2)

MMX 4 new data types New instructions Uses 8 existing 64 bit floating point registers

The floating point registers Floating point is processed by eight 80 bit registers ST(0), ST(1), …ST(7) in the floating point unit. When doing floating point arithmetic, these registers are organized in a stack. Programming floating point is quite different that programming integer arithmetic. Floating point calculations are done using 80 bits even when the program specifies storing 32 or 64 bit data values.

Advantages of using the floating point registers in MMX. The registers already exist. Only logic had to be added to the chip. The operating system already knows about the floating point registers. When a computer is switches from one program to another, the state (registers) of the current program must be saved so state can be restored when the program becomes the active program once again. The floating point registers are automatically saved as part of the state of a program. MMX worked under existing operating systems!

New data types for MMX 8 one byte integers: 2 four byte integers 64 bits long. One data item can store: 8 one byte integers: 4 two byte integers: 2 four byte integers 1 eight byte integer

SSE and SSE2 SSE – Streaming SIMD Extensions SSE2 introduced eight 128 bit XMM registers These registers are disjoint from the floating point/MMX registers SSE (Pentium III) can handle 4 single floating point numbers SSE2 (Pentium 4) can also handle 2 double floating point numbers

New data types for XMM 128 bits: Can be used as: 16 one byte integers 8 two byte integers 4 doubleword integers or single precision floating 2 quadword integers or double precision floating

Your turn Your program uses 3 arrays of 160,000 byte integers. We need to add the elements in the first two arrays to calculate the third array. Using a standard Pentium, how many “operations” are needed? (One operation includes loading 2 values into CPU, adding, storing the result and the associate loop processing) How many XMM operations would be needed?

New instructions Process the new data types 16, 8,4, or 2 data items (64 bits or 128 bits) at a time. Types of instructions: Add / Subtract Multiply/Multiply and add Shift Logical (AND, NAND, OR, XOR) Pack and unpack Move Shuffle and unpack (SSE)

Saturation Handling overflow when adding 16, 8, 4, or 2 values at a time is a problem. Programmers can specify that when overflow occurs, the “sum” should be replaced by the maximum legal value. Example: Unsigned byte addition 80h + A0h = 120h ===> overflow Instead the machine stores FFh. Likewise when subtracting.

Comparison operations Consider <, >, <=, >=, =, and < > operations. Consider comparing two 64 bits quantities each holding 8, 4, or 2 values. Comparing multiple values at a time is a problem. So the MMX instructions store 0 for false and -1 for true for each of individual data items.

Example 1: Calculating Dot Products 7 Consider calculating S = AiBi i = 0 using MMX Assume Ai and Bi are stored as signed 16 bit integers. Assume that the products and sums should be calculated using 32 bits. Assume that all values have two “binary” places.

Example 1: Calculating Dot Products Storing A and B (64 bit vectors) 0 2 4 6 8 10 12 14 bytes 0 1 2 3 4 5 6 7 subscripts A B We store each Ai and Bi item as 16 bit integers, 4 per 64 bit data item. Assume each value has 2 binary places

Example 1: Calculating Dot Products Multiply and add instruction * * * * * * * * + + + + 2 20 3 40 806 4 30 5 50 1520

Example 1: MMX: Calculating Dot Products Packed Multiply and add instruction * * * * * * * * + + + + Packed Add + + (Normal) Add + 2 20 4 30 3 40 5 50 806 1520 2326

Example 1: Calculating Dot Products Approximate algorithm Load left half of A into a FP register. Multiply and add by left half of B. Shift products right 2 bits. (Products should have only two binary places.) Repeat with right halves of A and B using a different register. Add the second sum to the first. Store the result. 4 words at a time Two doublewords at a time

Example 1: Calculating Dot Products Approximate algorithm (Conclusion) Add the two sums together in EAX to get the final sum. 1 double word at a time

Example 1: Calculating Dot Products Intel claims that standard Pentiums would require 40 instructions to carry this out. Using MMX technology, only 13 instructions are needed. Speed improves by even a greater ratio.

Example 2: 24-bit color video blending Suppose we have are displaying 640 by 480 pixel video that uses 24 bit colors - 8 bits for red, 8 for green, and 8 for blue. Suppose we are currently showing one picture which we want to fade out and replace by “fading” in a second picture. Suppose that we want to do the fade out/in in 255 steps.

Example 2: 24-bit color video blending For each step, for each of 3 colors and for each of the 640 by 480 pixels we must calculate: Result_pixel = NewPicture_pixel * (i/255) +OldPicture_pixel * (1-(i/255)) where “i” is the step counter. This formula must be calculated 640 * 480 * 3 * 255 = 235,008,000 times on 8 bit data!

Example 2: 24-bit color video blending Intel calculates that this requires execution of 1.4 billion instructions on a standard PC even ignoring the calculation of i/255 and (1-i/255) and loop control. With MMX, we can calculate 4 values in parallel. The number of MMX instructions would be 525 million. (Because the multiply instruction only applies to word data, the byte data must be unpacked into words and repacked after the calculation.)

Also included in MMX Intel increased cache size when MMX was introduced (necessary for SIMD machines) Programs run faster on MMX machines even if the SIMD instructions are not used Excellent marketing: Programs run faster on MMX machine People want/buy MMX Software publishers are encouraged to rewrite programs to take advantage of the new instructions

Information source http://www.intel.com/drg/mmx/manuals/ overview/index.htm#intro (no longer available) http://developer.intel.com/drg/mmx/manuals/ (no longer available) http://www.intel.com/design/Pentium4/manuals/24547012.pdf (IA-32 Intel Architecture Software Developer’s Manual, vol. 1) This slide show is MMX.PPT

Your Turn 1. Characterize the kinds of problems where SIMD is helpful. 2. Give examples of problems where SIMD is useful.