Intel Pentium 4 ENCM 515 - 2002 Jonathan Bienert Tyson Marchuk.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Superscalar and VLIW Architectures Miodrag Bolic CEG3151.
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Computer Organization and Assembly Languages Yung-Yu Chuang
Implementation of the Convolution Operation on General Purpose Processors Ernest Jamro AGH Technical University Kraków, Poland.
Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
Computer Organization and Architecture
Computer Organization and Architecture
Blackfin ADSP Versus Sharc ADSP-21061
This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during.
Processor Technology and Architecture
IA- 32 Architecture Richard Eckert Anthony Marino Matt Morrison Steve Sonntag.
Advanced Micro Devices - Athlon Buddy Guest Mike Lewitt Bill McCorkle November 28, 2001.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
The Pentium 4 CPSC 321 Andreas Klappenecker. Today’s Menu Advanced Pipelining Brief overview of the Pentium 4.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Copyright © 2006, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Intel® Core™ Duo Processor.
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
CS854 Pentium III group1 Instruction Set General Purpose Instruction X87 FPU Instruction SIMD Instruction MMX Instruction SSE Instruction System Instruction.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
CMPE 511 Computer Architecture Caner AKSOY CmpE Boğaziçi University December 2006 Intel ® Core 2 Duo Desktop Processor Architecture.
An Introduction to IA-32 Processor Architecture Eddie Lopez CSCI 6303 Oct 6, 2008.
NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.
Assembly Language for Intel-Based Computers, 4 th Edition Chapter 2: IA-32 Processor Architecture (c) Pearson Education, All rights reserved. You.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
XP Practical PC, 3e Chapter 16 1 Looking “Under the Hood”
Practical PC, 7th Edition Chapter 17: Looking Under the Hood
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.
Intel Pentium II Processor Brent Perry Pat Reagan Brian Davis Umesh Vemuri.
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
History of Microprocessor MPIntroductionData BusAddress Bus
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
RISC Architecture RISC vs CISC Sherwin Chan.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Introduction First 32 bit Processor in Intel Architecture. Full 32 bit processor family Sixth member of 8086 Family SX.
Introduction to MMX, XMM, SSE and SSE2 Technology
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
The Intel 86 Family of Processors
Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University
Lecture # 10 Processors Microcomputer Processors.
1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.
Chapter Overview General Concepts IA-32 Processor Architecture
Assembly language.
Microarchitecture.
x86 Processor Architecture
Central Processing Unit Architecture
William Stallings Computer Organization and Architecture 8th Edition
Embedded Systems Design
Introduction to Pentium Processor
Digital Signal Processors
Special Instructions for Graphics and Multi-Media
The Microarchitecture of the Pentium 4 processor
Comparison of Two Processors
Digital Signal Processors-1
Chapter 11 Processor Structure and function
Presentation transcript:

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk

Overview: Product review Specialized architectural features (NetBurst) SIMD instructional capabilities (MMX, SSE2) SHARC 2106x comparison

Intel Pentium 4 Reworked micro-architecture for high- bandwidth applications Internet audio and streaming video, image processing, video content creation, speech, 3D, CAD, games, multi-media, and multi-tasking user environments These are DSP intensive applications! – What about uses other than in PC?

Hardware Features: Hardware Features: (NetBurst micro-architecture) Hyper pipelined technology Advanced dynamic execution Cache (data, L1, L2) Rapid ALU execution engines 400 MHz bus OOE Microcode ROM

Hyper Pipeline 20-stage pipeline!!! breaks down complex CISC instructions –sub-stages mimic RISC –faster execution

Filling the pipeline... Review of next 126 instructions to be executed Branch prediction –if mispredict must flush 20-stage pipeline!!! –branch target buffer (BTB) –4K branch history table (BHT) –assembly instruction hints

Cache 8KB Data Cache L1 Execution Trace Cache –12K of previous micro-instructions stored –saves having to translate L2 Advanced Transfer Cache –256K for data –256-bit transfer every cycle allows 77GB/s data transfer on 2.4GHz

Rapid ALU Execution Engines 2 ALUs –allow parallel operations Many arithmetic operations take 1/2 cycle –each 2X ALU can have 2 operations per cycle

Software Features: Multimedia Extensions (MMX) –8 MMX registers Streaming SIMD Extensions (SSE2) –8 SSE/SSE2 registers Standard x86 Registers –EAX, EBX, ECX, EDX, ESI, etc. –Register rename to over 100

MMX (Multimedia Extensions) Accelerated performance through SIMD multimedia, communication, internet applications 64-bit packed INTEGER data –signed/unsigned

SSE2 (Streaming SIMD Extensions) Accelerate a broad range of applications –video, speech, and image, photo processing, encryption, financial, engineering, and scientific applications 128-bit SIMD instruction formats  4 single precision FP values  2 double precision FP values  16 byte values  8 word values  4 double word values  2 quad word values  bit integer value

SIMD Example 16-tap FIR filter - Real numbers) SIMD Example (16-tap FIR filter - Real numbers) Applications for real FIR filters general purpose filters in image processing, audio, and communication algorithms Will utilize SSE2 SIMD instruction set

Thinking about SIMD SSE2 instruction format is 128-bits 128-bit SSE2 registers Many data formats! What precision do we want? Lets use 32-bit floating point for coefficients, input, output 4 data sets x 32-bit = 128 bits

Parallelizing Require many single multiplications (coefficients x inputs), then add the results for output! Multiplications… then need to perform additions...

Using SSE2 format Can hold 4 elements of an array (of 32-bit data) in each 128-bit register 4 single precision floating point ops per cycle (32-bit)

Additions... In both registers, now have 4 32-bit results –First add the results into an accumulator register 4 single precision floating point ops per cycle (32-bit)

Additions... In a register, now have 4 32-bit results –however, NO SSE2 instruction to add these 4! –But can use other instructions Some BIT INTERTWINING…then add –This will give results for several output values!

ADI SHARC 21k vs. P4 Disadvantages Slower clock speed (40MHz vs 2400MHz) Less opportunities for parallelism (5 vs 11) Much less memory (Cache and System) –Limited algorithm applicability –Limited applications Older (Less support – compiler) –1994 vs 2001

ADI Sharc 21k vs. P4 Advantages Hardware loops Easier to program for optimal speed Cheaper Lower power consumption Runs cooler

FIR Performance Hard to obtain P4 performance numbers Can estimate based on 2 FP multiplies per clock, clock rate and assumption that pipeline can be kept full. –2 * 2.4GHz ~ 4.8 billion multiplies per second –If ~4 multiplies per element & samples/s –FIR length > ~25k taps SHARC => ~ 200 taps (Lab 4) Factor of ~125x

IIR Performance Hard to obtain P4 performance numbers No hardware circular buffers Does have BTB, BHT, etc. Prefetches ~256bytes ahead of current position in code.

FFT Performance Hard to obtain P4 performance numbers Prime95 uses FFT to calculate Lucas- Lehmer test for Mersenne Primes –Involves FFT, squaring and iFFT, etc. 256k points on P4 2.3GHz ~ ms Compare to SHARC 2048 point FFT ~0.37ms If SHARC could do 256k, 46.25ms (But…)

Optimization Example Hard to optimize Pentium 4 assembly Example of multiplying by a constant, 10 Taken mainly from:

Multiplying by 10 Slowest way: –IMUL EAX, 10 Usually optimal way (Visual C++ 6.0) –LEA EAX, [EAX+EAX*4] –SHL EAX, 1 –Shift – Add – Shift –On most x86 processors takes 2 cycles –Pentium MMX and before 3 cycles –On Pentium 4 takes 6 cycles!

Multiplying by 10 Optimal for Pentium 4 –LEA ECX, [EAX + EAX] –LEA EAX, [ECX+EAX*8] –On most x86 still takes 2 cycles –On Pentium 4 takes ~ 3 cycles (OOE -  Ops) –But on older processors Pentium MMX and before this now takes 4 cycles!

Multiplying by 10 Best generic case –LEA EAX, [EAX + EAX*4] –ADD EAX, EAX –On most x86 still takes 2 cycles –On older processors Pentium MMX and before this now takes 3 cycles again –On Pentium 4 this takes 4 cycles Obviously really hard to optimize

REFERENCES Intel application note: AP Real and Complex Filter Using Streaming SIMD Extentions graphics from: /p4-01.html