Streaming SIMD Extensions CSE 820 Dr. Richard Enbody.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 3:Memory management, floating point dr.ir. A.C.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
The University of Adelaide, School of Computer Science
ECE291 Computer Engineering II Lecture 24 Josh Potts University of Illinois at Urbana- Champaign.
Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Chapter Six 1.
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
SECTION 4a Transforming Data into Information.
Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Optimizing Ogg Vorbis performance using architectural considerations Adir Abraham and Tal Abir.
EET Advanced Digital Display Adapters. A vital part to the system provides the visual part of the Human/Computer interface In boot process, goes.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
CS854 Pentium III group1 Instruction Set General Purpose Instruction X87 FPU Instruction SIMD Instruction MMX Instruction SSE Instruction System Instruction.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Raghu Machiraju Slides: Courtesy - Prof. Huamin Wang, CSE, OSU
Connecting with Computer Science 2 Objectives Learn why numbering systems are important to understand Refresh your knowledge of powers of numbers Learn.
Processor Organization and Architecture
Computer performance.
CSE 690: GPGPU Lecture 4: Stream Processing Klaus Mueller Computer Science, Stony Brook University.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
Parallelism Processing more than one instruction at a time. Pipelining
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
XP Practical PC, 3e Chapter 16 1 Looking “Under the Hood”
Multi-core architectures. Single-core computer Single-core CPU chip.
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /14/2013 Lecture 16: Floating Point Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
HW-Accelerated HD video playback under Linux Zou Nan hai Open Source Technology Center.
MMX technology for Pentium. Introduction Multi Media Extension (MMX) for Pentium Processor Which has built in 80X87 Can be switched for multimedia computations.
Pipeline Extensions prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University MIPS Extensions1May 2015.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Principles of Linear Pipelining
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
Introduction to MMX, XMM, SSE and SSE2 Technology
CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.
November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.
Chapter One Introduction to Pipelined Processors
GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Playstation2 Architecture Architecture Hardware Design.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Data Representation. How is data stored on a computer? Registers, main memory, etc. consists of grids of transistors Transistors are in one of two states,
® GDC’99 Streaming SIMD Extensions Overview Haim Barad Project Leader/Staff Engineer Media Team Haifa, Israel Intel Corporation March.
Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.
Morgan Kaufmann Publishers
Vector Processing => Multimedia
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
STUDY AND IMPLEMENTATION
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Multivector and SIMD Computers
Computing in COBOL: The Arithmetic Verbs and Intrinsic Functions
MMX technology for Pentium
Presentation transcript:

Streaming SIMD Extensions CSE 820 Dr. Richard Enbody

Michigan State University Computer Science and Engineering Why SSE? 3D multimedia Floating-point (FP) computation is the heart of 3D geometry An increase of x was required in order to have a visually perceptible difference in performance Accelerate single-precision FP

Michigan State University Computer Science and Engineering Other issues Feedback on MMX Cache instructions to improve memory accesses

Michigan State University Computer Science and Engineering New 70 new instructions 1 new state

Michigan State University Computer Science and Engineering 2-Wide vs. 4-Wide SIMD-FP 4-wide single-precision FP per clock could be done without significant cost double-cycle existing 64-bit hardware to get x improvements

Michigan State University Computer Science and Engineering More functional units? much larger area and timing cost, by increasing busses, register file ports, execution hardware, and scheduling complexity.

Michigan State University Computer Science and Engineering Data Path Width? Current was 80-bits 256-bits is way too expensive Too much requires extra bandwidth 128-bits is reasonable compromise

Michigan State University Computer Science and Engineering Registers Couldn’t overlap with existing registers: only 8 original 80-bit registers yields –four 4-wide 128-bit registers, or –eight 2-wide 64-bit registers (no gain) do not want to share with MMX –complexity –structural hazard

Michigan State University Computer Science and Engineering New Register Set (State) New registers allow concurrency Problem of adding a new state was resolved by implementing it earlier to allow O/S to support it before needed.

Michigan State University Computer Science and Engineering SSE Registers

Michigan State University Computer Science and Engineering Pentium III Issues 2 64-bit micro-instructions which can hold a 4-wide SIMD operation so if instructions alternate between functional units, 4x speed is achievable Scalar instructions were included so combined scalar & SIMD could be done together

Michigan State University Computer Science and Engineering Memory Streaming data may not stay in cache, but you cannot go to memory on each access Solution: HINTS with no state change –prefetch next data cache instruction (can specify memory hierarchy level) –noncached stores

Michigan State University Computer Science and Engineering Concurrency

Michigan State University Computer Science and Engineering Alignment Data must be aligned Fixing alignment costs time so raise an exception

Michigan State University Computer Science and Engineering IEEE compliance Two modes –IEEE Compliant (slower) –Flush-To-Zero (FTZ) (faster)

Michigan State University Computer Science and Engineering Packed Operation

Michigan State University Computer Science and Engineering Barrier (Fence) New light-weight fence (SFENCE) instruction ensures that all stores that precede the fence are observed on the front-side bus before any subsequent stores are completed. SFENCE is targeted for uses such as writing commands from the processor to the graphics accelerator

Michigan State University Computer Science and Engineering Conditional The basic single precision FP comparison instruction (CMP) is similar to existing MMX instruction variants (PCMPEQ, PCMPGT) in that it produces a redundant mask per float of all 1's or all 0's depending upon the result of the comparison. Used for masking for conditional move

Michigan State University Computer Science and Engineering MIN/MAX CMOV the MAX/MIN instructions perform conditional move in only one instruction by directly using the carry-out from the comparison subtraction to select which source to forward as a result. Within 3D geometry and rasterization, color clamping is an example that benefits from the use of MINPS/PMIN.

Michigan State University Computer Science and Engineering MIN/MAX CMOV A fundamental component in many speech recognition engines is the evaluation of a Hidden-Markov Model (HMM); this function comprises upwards of 80% of execution time. The PMIN instruction improves this kernel performance by 33%, giving a 19% application gain.

Michigan State University Computer Science and Engineering Data Manipulation Organizing the display list for an ideal SIMD format is called Structure-of- Arrays (SOA) since the structure contains separate x, y, z, and w arrays Instructions which support conversion from AOS are supplied Converting to fit SIMD is better overall than executing AOS code inefficiently

Michigan State University Computer Science and Engineering Reciprocal and Reciprocal Square Root Uses: –transformation –specular lighting –geometric normalization For a basic geometry pipeline, these instructions can improve overall performance on the order of 15%.

Michigan State University Computer Science and Engineering New MMX 3D Rasterization is greatly improved by unsigned MMX multiply: application- level performance gain of 8%-10%. byte-masked write instruction selectively writes directly to memory bypassing the cache

Michigan State University Computer Science and Engineering Packed Average Motion compensation is a key component of the MPEG-2 decode pipeline: reconstituting each frame of the output picture stream by interpolating between key frames. This interpolation primarily consists of averaging operations between pixels from different macroblocks (16x16 pixel unit).

Michigan State University Computer Science and Engineering Packed Average Speedup The PAVG instruction enabled a 25% kernel speedup on motion Compensation of a DVD player. At the application level: 4%-6% speedup The application level gain can increase to 10% for higher resolution HDTV digital television formats.

Michigan State University Computer Science and Engineering Packed Sum of Absolute Differences Video encode: 40%-70% in motion-estimation This single instruction replaces on the order of seven MMX instructions in the motion-estimation inner loop so PSADBW has been found to increase motion-estimation performance by a factor of two.

Michigan State University Computer Science and Engineering Improvements real-time rendering of complex worlds real-time video encoding (MPEG-1 & 2) DVD decode at 30 frames per second 1M-pixel HDTV format decode home video editing reduced speech error rates

Michigan State University Computer Science and Engineering Cost 10% increase in die similar to MMX cost