Intel Core2 GHz Q6700 L2 Cache 8 Mbytes (4MB per pair) L1 Cache: (128 KB Instruction +128KB Data at the core level???) L3 Cache: None? CPU.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
DSPs Vs General Purpose Microprocessors
CS 7810 Lecture 22 Processor Case Studies, The Microarchitecture of the Pentium 4 Processor G. Hinton et al. Intel Technology Journal Q1, 2001.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
MEMORY HIERARCHY – Microprocessor Asst. Prof. Dr. Choopan Rattanapoka and Asst. Prof. Dr. Suphot Chunwiphat.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
UNDERSTANDING CPU’S By: Matt Walbert, Alex Puleo, Anthony Minnocci, Chris Barrett.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
The First Microprocessor By: Mark Tocchet and João Tupinambá.
Instructor: Sazid Zaman Khan Lecturer, Department of Computer Science and Engineering, IIUC.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
1 Microprocessor-based Systems Course 4 - Microprocessors.
Processor history / DX/SX SX/DX Pentium 1997 Pentium MMX
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.
Cosc 2150 Current CPUs Intel and AMD processors. Notes The information is current as of Dec 5, 2014, unless otherwise noted. The information for this.
Copyright © 2006, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Intel® Core™ Duo Processor.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
CS854 Pentium III group1 Instruction Set General Purpose Instruction X87 FPU Instruction SIMD Instruction MMX Instruction SSE Instruction System Instruction.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Practical PC, 7th Edition Chapter 17: Looking Under the Hood
Types of Computers Mainframe/Server Two Dual-Core Intel ® Xeon ® Processors 5140 Multi user access Large amount of RAM ( 48GB) and Backing Storage Desktop.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Extracted directly from:
An Introduction to 64-bit Computing. Introduction The current trend in the market towards 64-bit computing on desktops has sparked interest in the industry.
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
History of Microprocessor MPIntroductionData BusAddress Bus
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Introduction to MMX, XMM, SSE and SSE2 Technology
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
MULTICORE PROCESSOR TECHNOLOGY.  Introduction  history  Why multi-core ?  What do you mean by multicore?  Multi core architecture  Comparison of.
Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,
Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Central Processing Unit (CPU) The Computer’s Brain.
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Hardware Architecture
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
Modern Processors.  Desktop processors  Notebook processors  Server and workstation processors  Embedded and communications processors  Internet.
Intel and AMD processors
CPU Central Processing Unit
Computer Hardware What is a CPU.
Visit for more Learning Resources
Distributed Processors
Multi-core processors
Multi-core processors
What happens inside a CPU?
CPU Central Processing Unit
CPU Central Processing Unit
Comparison of Two Processors
Types of Computers Mainframe/Server
Memory System Performance Chapter 3
Lecture 20 Parallel Programming CSE /27/2019.
Presentation transcript:

Intel Core2 GHz Q6700 L2 Cache 8 Mbytes (4MB per pair) L1 Cache: (128 KB Instruction +128KB Data at the core level???) L3 Cache: None? CPU Frequency: 2.66 Ghz Bus Speed: GHz (FSB=Front Side Bus) (Multiplier=10?) Code Name: Kentsfield (xeon or not? Not on my machine?) Sam Williams Diagram Clovertowns marketed as xeon’s

One thread (8.87/2.66=3.33 flops/cycle) >> maxNumCompThreads(1); >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=5000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = (8.48/8.87=0.95)

Two threads (17.11/2.66=6.43 flops/cycle) >> maxNumCompThreads(2); >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=5000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=5000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=6000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=6000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = ( / = 0.86)

Four threads (29.56/2.66=11.1 flops/cycle) >> maxNumCompThreads(4); >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=1000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=2000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=4000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=6000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = >> n=6000; a=randn(n); tic, a*a; t=toc; (2*n^3)/(t*1e9) ans = /29.11=0.82

Summary Threads = 1/2/4 Maximum Gflops: 8.87/ 17.11/29.11 Maximum Gflops/cycle: 3.33/6.43/11.1 Maximum Gflops/cycle/thread: 3.33/3.21/2.78 Minimum (n=1000)/Maximum (n=5000or6000) – 0.95/0.86/0.82 All indicative of an ability to do 4 mults and 4 adds per core per cycle, but not enough memory bandwidth to keep the processors going at full capacity.

Matrix Add >> n=5000; a=randn(n,n); tic, c=a+0; t=toc;(2.66*1e9*t)/(2*n^2) ans = >> maxNumCompThreads(4); >> n=5000; a=randn(n,n); tic, c=a+0; t=toc;(2.66*1e9*t)/(2*n^2) ans = Conclusion: Takes about 12 cycles per read and write independent of operations i.e. in one cyle we have (1/12) of 8 bytes moving In one second we have (2.66*1e9)*(1/12)* 8 bytes = 1.7 GB/second (seems slow!)

One can try a model Cycles = (read/writes)*12 + (flops)/(4*p*efficiency) But good luck! (not sure if this accounts for all that is going on and maybe one shouldn’t decouple the memory starvation from the efficiency. You can see what you can do if you like. I’m dissapointed this is so non-predictive.)

/attachments/ /a38- mattson.pdf As a second point of comparison, consider Intel® Core™ 2 Quad processor CPU running at 2.66 GHz with a thermal design power of 95W (model number Q6700) [Intel2008]. This CPU was manufactured using the same 65 nm process technology as was used for the 80-core Terascale processor. A Core™ 2 core includes two 128 bit wide SIMD FPU that support the SSE3 instructions each of which can retire up to 4 single precision floating point operations per cycle. Hence, the peak performance of this quad core CPU is: 4 core*8flop/core*2.66 GHZ = single precision GFLOPS This translates to 0.9 GFLOP/Watt making the 80-core Terascale processors (19.4 GFLOP/W at TFLOP) over 20 times more power efficient than a more traditional “big core” multicore CPU.

Wikipedia: The Kentsfields comprise two separate silicon dies (each equivalent to a single Core 2 duo) on one MCM. [30] This results in lower costs but lesser share of the bandwidth from each of the CPUs to the northbridge than if the dies were each to sit in separate sockets as is the case for example with the AMD Quad FX platformMCM [30]northbridgeAMD Quad FX platform

Wikipedia The multiple cores of the Kentsfield most benefit applications that can easily be broken into a small number of parallel threads (such as audio and video transcoding, data compression, video editing, 3D rendering and ray-tracing). To take a specific example, multi-threaded games such as Crysis and Gears of War which must perform multiple simultaneous tasks such as AI, audio and physics benefit from the quad-core CPUs. [35] In such cases, the processing performance may increase relative to that of a single-CPU system by a factor approaching the number of CPUs. This should, however, be considered an upper limit as it presupposes the user-level software is well- threaded. To return to the above example, some tests have demonstrated that Crysis fails to take advantage of more than two cores at any given time. [36] On the other hand, the impact of this issue on broader system performance can be significantly reduced on systems which frequently handle numerous unrelated simultaneous tasks such as multi-user environments or desktops which execute background processes while the user is active. There is still, however, some overhead involved in coordinating execution of multiple processes or threads and scheduling them on multiple CPUs which scales with the number of threads/CPUs. Finally, on the hardware level there exists the possibility of bottlenecks arising from the sharing of memory and/or I/O bandwidth between processors.threadstranscodingdata compressionvideo editing3D renderingray-tracingCrysisGears of War [35] [36] I read this as you might hopefully get 4 fold speedups but some people say you might only get 2, and it all depends, and nobody really seems to know for sure

Theoretical Memory Bandwidth (Clock Frequency) * (Data Path Width) * (Transfers per clock cycle) (1.066 GHz) * (8 bytes?????) * (4)??? Might be 4=two possibilities during clock rise and two during clock fall “quad-pumped?” This would be 32 GB/sec Sam Williams says 10.6 or 21.3 on clovertown I see 1.7??

SSE Streaming SIMD Extensions Cores have 128 bit registers (eight of them??) That allow four single precision, or two double precision ops per second See: Especially packed add ADDPS, and packed multiply MULPS See: /strmsimd/simd.htm /strmsimd/simd.htm