Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

Slides:



Advertisements
Similar presentations
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Advertisements

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Microprocessor Microarchitecture Multithreading Lynn Choi School of Electrical Engineering.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
The AMD K8 Processor Architecture December 14 th 2006.
Advanced Micro Devices - Athlon Buddy Guest Mike Lewitt Bill McCorkle November 28, 2001.
Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
The AMD and Intel Architectures COMP Jamie Curtis.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Hiep Hong CS 147 Spring Intel Core 2 Duo. CPU Chronology 2.
CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.
CMPE 511 Computer Architecture Caner AKSOY CmpE Boğaziçi University December 2006 Intel ® Core 2 Duo Desktop Processor Architecture.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
1 Overview 2 Cache entry structure 3 mapping function 4 Cache hierarchy in a modern processor 5 Advantages and Disadvantages of Larger Caches 6 Implementation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Multi-core architectures. Single-core computer Single-core CPU chip.
By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp 마이크로 프로세서.
Multi-Core Architectures
Understanding The Nehalem Core Note: The examples herein are mostly illustrative. They have shortcommings compared to the real implementation in favour.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
History of Microprocessor MPIntroductionData BusAddress Bus
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
AMD Opteron Overview Michael Trotter (mjt5v) Tim Kang (tjk2n) Jeff Barbieri (jjb3v)
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.
Alpha Supplement CS 740 Oct. 14, 1998
Hyper-Threading Technology Architecture and Microarchitecture
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
Central Processing Unit (CPU) The Computer’s Brain.
My Coordinates Office EM G.27 contact time:
Modern general-purpose processors. Post-RISC architecture Instruction & arithmetic pipelining Superscalar architecture Data flow analysis Branch prediction.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
PipeliningPipelining Computer Architecture (Fall 2006)
Use of Pipelining to Achieve CPI < 1
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Itanium® 2 Processor Architecture
CS 352H: Computer Systems Architecture
Instruction Level Parallelism
Visit for more Learning Resources
Multi-core processors
Computer Structure Multi-Threading
INTEL HYPER THREADING TECHNOLOGY
Multi-core processors
5.2 Eleven Advanced Optimizations of Cache Performance
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
PIII Data Stream Power Saving Modes Buses Memory Order Buffer
Introduction to Pentium Processor
Hyperthreading Technology
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Comparison of Two Processors
CPE 631: Multithreading: Thread-Level Parallelism Within a Processor
* From AMD 1996 Publication #18522 Revision E
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Advanced Computer Architecture 5MD00 / 5Z032 SMT Simultaneously Multi-Threading Henk Corporaal TUEindhoven.
Presentation transcript:

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th

Why is the Core Better at Prefetching and Caching? 3 prefetchers, 2 data, 1 instruction, per core 2 prefetchers for the shared L2-cache Eight prefetchers active in a Core 2 Duo CPU Load operations (data prefetch) or demand bandwidth gets priority Data prefetch uses the store port for the tag lookup…Why more Loads than Stores

Cache Comparison

The Memory subsystem K8 has a bigger 2 x 64 KB L1 cache but Core’s 8-way 32 KB cache will have a hit rate close to that of a 2-way 64 KB cache K8 on die direct memory controller lowers the latency to RAM considerably But…Core CPUs have much bigger caches and much smarter prefetching Core’s L1 cache delivers about twice as much bandwidth and its L2-cache is about 2.5 times faster than that of the Athlon 64 or Opteron.

Decoding “In almost every situation, the Core architecture has the advantage. It can decode 4 x86 instructions per cycle, and sometimes 5 thanks to x86 fusion. AMD’s Hammer can do only 3.”

Out of Order execution Core 96 entry ROB buffer is, thanks to Macro-op fusion, bigger than the 72 Entry Macro-op buffer of the K8 Core uses a central reservation station, while the Athlon uses distributed schedulers  A central reservation station has better utilization while distributed schedulers allow more entries. Both do 1 branch prediction per cycle Core outperforms K8 on 128-bit SSE2/3 processing due to its 3 units  K8 128-bit SSE instructions are decoded into two separate 64-bit instructions: Core does this twice as fast  Core can do 4 Double Precision 64 bit FP calculations per cycle, while the Athlon 64 can do just 3 K8 has a small advantage as it has 3 AGU compared to Core's 2  However, deeper, more flexible out of order buffers and bigger, faster L2-cache of the Core should negate this small advantage in most integer workloads

A Tale of Two Cores…

Better Out of Order Execution… The K8 Athlon 64 can only move loads before independent ALU operations (ADD etc.)Loads cannot be moved ahead much at all to minimize the effect of a cache miss, and other loads cannot be used to keep the CPU busy if a load has to wait for a store to finish. The K8 has some Load/Store reordering, but it's much later in the pipeline and is less flexible than the Core architecture Vs. Core’s approach to determining whether a Load and a Store share the same address is called Memory Disambiguation. The P8 terefore permits Loads to move ahead of Stores thereby giving a big performance boost. Intel claim up to a 40% performance boost in some instances: however, 10-20% increase in performance is possible using the fast L2 and L1 cache

HyperThreading and Integrated Memory Controller There is no Simultaneous Multi Threading (SMT) or HyperThreading in the Core architecture  SMT can offer up to a 40% performance boost in server applications  However, TLP is being addressed by increasing the number of cores on-die: e.g. the 65 nm Tigerton is two Woodcrests in one package giving 4 cores IMC was not adopted as the transistors were better spent in the 4 MB shared cache

Conclusions Compared to the AMD K8 the Intel’s Core is simply a wider, more efficient and more out of order CPU Memory disambiguation enabled increases in ILP and the massive bandwidth of the L1 and L2 caches delivers 33% and comes close to 33% more performance, clock-for-clock AMD could enhance the SSE/SIMD power by increasing the width of each execution unit or by simply implementing more of them in the out of order FP pipeline and improve the bandwidth of the two caches further If AMD adopts a more flexible approach to reordering of Loads - even without memory disambiguation – a 5% increase in IPC is possibel Core may provide a couple of free lunch vouchers to programmers building single threaded applications…for now!