Comparison of Two Processors

Slides:

Advertisements

Similar presentations

Computer Organization and Architecture

Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.

ARCHITECTURE OF APPLE’S G4 PROCESSOR BY RON WEINWURZEL MICROPROCESSORS PROFESSOR DEWAR SPRING 2002.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Intel Xeon Nehalem Architecture Billy Brennan Christopher Ruiz Kay Sackey.

THE AMD-K7 TM PROCESSOR Microprocessor Forum 1998 Dirk Meyer.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

1 Microprocessor-based Systems Course 4 - Microprocessors.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Advanced Micro Devices - Athlon Buddy Guest Mike Lewitt Bill McCorkle November 28, 2001.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Chapter 12 CPU Structure and Function. Example Register Organizations.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Cache Organization of Pentium

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

The Intel 86 Family of Processors

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.

The Pentium Series CS 585: Computer Architecture Summer 2002 Tim Barto.

CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.

ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Itanium® 2 Processor Architecture

Protection in Virtual Mode

Cache Organization of Pentium

Instruction Level Parallelism

Visit for more Learning Resources

ALPHA Introduction I- Stream

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Computer Structure Multi-Threading

Timing Model of a Superscalar O-o-O processor in HAsim Framework

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 14 Instruction Level Parallelism and Superscalar Processors

CC 423: Advanced Computer Architecture Limits to ILP

PIII Data Stream Power Saving Modes Buses Memory Order Buffer

Flow Path Model of Superscalars

Introduction to Pentium Processor

Hyperthreading Technology

The Microarchitecture of the Pentium 4 processor

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Superscalar Pipelines Part 2

Intel Xeon Nehalem Architecture

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )

Alpha Microarchitecture

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 20: OOO, Memory Hierarchy

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

* From AMD 1996 Publication #18522 Revision E

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Computer Architecture

CS 286 Computer Organization and Architecture

The University of Adelaide, School of Computer Science

Presentation transcript:

Comparison of Two Processors P-3 vs. Athlon Comparison of Two Processors

Objectives Similarities Architectural Differences Head to Head Comparisons Questions Road Map- Start with some high level and go into detail for differences.

Similarities Microarchitecture 4 FP Operations per Cycle Caching, Pre-fetching, Streaming Controls Multiple Processor Support Both microarchitectures are three issue out of order designs with long pipelines Both provide methods to generate 4 32 bit FP instructions per clock cycle Primary difference between 3D Now and ISSE is that 3DNow works with SIMD with two parallel registers in single percision and possesses two pipelines. The ISSE features 4 parallel SIMD registers with only one pipeline. Which mean they both produce 4 SIMD insts per cycle.

Athlon’s Architecture ISA Instruction Stream Data Stream

ISA Athlon – 3DNow! 5 new DSP instructions 45 New instructions Athlon’s 3DNow implements 19 new MMx/SSE and cache control instructions, these instructions are identical to the ones P-3’s new media instructions. This intelligent move allows Athlon to compete with P-3 in video encoding and decoding and multimedia instructions. These instructions do not however bring Athlon up to SSE’s SIMD FP standard. But given Intel’s half-wide implementation of SIMD-FP capability, it can still provide stiff competition for P-3. In addition to these 19, AMD added 5 new DSP, instructions to improve application performance such as soft modems, Dolby AC-3, MP3 audio codecs. Including other instructions, AMD totals in at 45 new instructions compared to P-3’s 71 new instructions.

Instruction Stream Instruction Decode Instruction Control Unit Execution Units Branch Prediction

Instruction Decode Athlon takes the 16 bytes from the I-cache and is capable of producing 3 micro Ops per cycle using the three x86 decoders in the direct path and a 4th decoder is used for complex instructions in the vector path. These instructions are stored into the 72 entry ROB. In comparison P-3 has 2 simple and 1 complex inst and only 20 entry queue for Reservation station and a 40 entry ROB

Instruction Control Unit Up to 72 instructions can be in flight at any given time, which is a result of the 72 entry ROB. From here the ICU takes over. The ICU control instruction issue, register renaming, and out of order execution of integer instructions. It two distinct 24 entry register files. The Future File holds the current state of the processor and is update as instructions complete execution. The results of execution are stored in the ROB and are retired to the Architectural file in program order. On an exception, the processor can be quickly check-pointed back to the correct Architectural state using a broadside copy into the Future File, hence the 24 ports. The advance to the FF file is that it can reduce the lookup time and complexity associated with classical ROB.

Execution Units FP Units Athlon: 3 P-3: 1 Integer Units Athlon: 3 Address Caclulation Units Athlon: 3 P-3: 2 Athlon – 10 stage integer Pipe and a 15 stage FP pipe, all pipes are out-of-order, superscalar each with 1 cycle through-puts vs. P-3’s 17 stage integer pipe and 30 stage FP pipe

Branch Prediction Athlon has 4096 entry BHT with a 2 bit prediction updated saturation counter. The BHT is accessed by using Gshare technique by an 8 bit global history register hashed with 4 bits of the PC. Ahtlon also has a 4K Branch Target Address Cache integrated with the Instruction Cache. For each 16-byte fetch quantum, the I-cache keeps two BTA and a 2 bit selector which indicates weather the branch is sequential or provided by the return stack. Athlon has a 12 stage return stack, to take advantage of procedure calls branches. And has an average hit rate 95% . In comparison, P-3 – 512 entry Branch Target Buffer with 4k BHT, 4 stage return stack, with a 95% hit rate

Data Stream Cache Bus Architecture

Cache L1 Cache L2 Cache Cache Controller Fill Buffers Athlon 128KB P-3 32KB L2 Cache Athlon 256KB (on chip) P-3 256KB (on chip) 512KB (external) Cache Controller Fill Buffers Write Back Buffers L1 cache- Athlon has 4 times the cache as the P-3 (128 vs 32). AMD has a dedicated snoop port to eliminate system coherency traffic from interfering with app performance but none in P-3. It also supports concurrent accesses by two 64-bit loads or stores. 64 Data Cache and 64 Inst Cache which is 2-way set associative. L2 cache- The L2 on chip cache is 16 way associatve in the Athlon, in comparison the P-3 has an 8 way associative cache. Controller – The P-3 Controller sits as a separate component on the module while AMD has its integrated into the die Athlon has a multi-level 512 entry TLB (Translation Look-aside Buffer) Athlon has an 8-bit ECC (error correction code). Athlon supports 64-byte cache line transfers, twice the size of P-3. Athlon’s cache architecture is the first to incorporate a system based MOESI (Modify, Owner, Exclusive, Shared, Invalid) for multiprocessing support. Fill, Bus: Athlon 8 of each vs. P-3 6 & 4.

Bus Architecture System Bus Address Size Multiprocessor Support Athlon utilizes Digital Corp’s EV6 Bus protocol. Protocol is primarily used for bursty traffic. The bus has 13 address channels per direction or 26 address channels and 64 data channels. The processor needs the read direction used for snooping. The address size is 43 bits for the Athlon and 36 bits for P-3. This means the Athlon can address up to 8 terabytes while the P-3 can only address 68 GB. Athlon Point-to-Point support while the P-3 has a shared support. This means that in the P-3 each processor must share the bandwidth and channels. AMD however, does not have this problem, with Athlon each processor has its own path to the chip set hence has the full bandwidth. This complicates board design for multiprocessor systems. To control this problem AMD is using Digital’s Tsunami dual processor chip set.

Data Stream Bus Speed Peak Bandwidth Outstanding Transactions System Clock Bus Speed: Athlon 200 – 266 MHz vs. P-3 100 – 166 MHz Bandwidth: Athlon 1.6 – 2.1 GB/sec vs. P-3 533 – 1060 MB/sec Transactions: Athlon 24 per processor vs. P-3 4 – 8 per processor Clock: Athlon Source Synchronous vs. P-3 Common clock Multiprocessing: Athlon Point-to-Point vs. P-3 shared

Head to Head Integer Performance Floating Point Performance Multimedia 3D Now Implementation Application Level

Benchmarked Specs This is the hardware and software used to perform the benchmark.

Integer Performance Athlon 8-17%

Floating Point Performance Athlon 7-11%

Multimedia Performance Athlon out performs by 5 – 25%

3DNow vs. SSE Implementation Athlon the winner 7-41%

Application Level Performance Athlon 4- 15%

Conclusion GO ATHLON!! Athlon is superior in every sense of the word. It out performed P-3 with out problems