CPE 731 Advanced Computer Architecture Multiprocessor Introduction

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Multithreading II Steve Ko Computer Sciences and Engineering University at Buffalo.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Lecture 13: Multiprocessors Kai Bu
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Pipelining and Parallelism Mark Staveley
Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can.
Outline Why this subject? What is High Performance Computing?
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
CS203 – Advanced Computer Architecture Performance Evaluation.
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Lecture 13: Multiprocessors Kai Bu
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 34 Multiprocessors (Shared Memory Architectures) Prof. Dr. M. Ashraf Chughtai.
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
CSE 502 Graduate Computer Architecture Lec – Symmetric MultiProcessing
18-447: Computer Architecture Lecture 30B: Multiprocessors
CMSC 611: Advanced Computer Architecture
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Parallel Processing - introduction
CS 147 – Parallel Processing
Electrical and Computer Engineering
Adopted from Professor David Patterson
CPSC 614 Computer Architecture Lec 9 – Multiprocessor
Electrical and Computer Engineering
Multi-Processing in High Performance Computer Architecture:
Flynn’s Taxonomy Flynn classified by data and control streams in 1966
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Chapter 5 Multiprocessor and Thread-Level Parallelism
Multicore / Multiprocessor Architectures
Chapter 17 Parallel Processing
Department of Computer Sciences The George Washington University
Multiprocessors - Flynn’s taxonomy (1966)
Multiprocessors CS258 S99.
CS 213: Parallel Processing Architectures
Introduction to Multiprocessors
Parallel Processing Architectures
Electrical and Computer Engineering
Electrical and Computer Engineering
Chapter 4 Multiprocessors
Multiprocessor and Thread-Level Parallelism Chapter 4
Presentation transcript:

CPE 731 Advanced Computer Architecture Multiprocessor Introduction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California, Berkeley CS252 S05

Outline MP Motivation SISD v. SIMD v. MIMD Centralized vs. Distributed Memory Challenges to Parallel Programming Conclusion 4/17/2017 CPE 731, MP Intro

Uniprocessor Performance (SPECint) 3X From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present 4/17/2017 CPE 731, MP Intro CS252 S05

Déjà vu all over again? “… today’s processors … are nearing an impasse as technologies approach the speed of light..” David Mitchell, The Transputer: The Time Is Now (1989) Transputer had bad timing (Uniprocessor performance)  Procrastination rewarded: 2X seq. perf. / 1.5 years “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) All microprocessor companies switch to MP (2X CPUs / 2 yrs)  Procrastination penalized: 2X sequential perf. / 5 yrs Manufacturer/Year AMD/’05 Intel/’06 IBM/’04 Sun/’05 Processors/chip 2 8 Threads/Processor 1 4 Threads/chip 32 4/17/2017 CPE 731, MP Intro CS252 S05

Other Factors  Multiprocessors Growth in data-intensive applications Data bases, file servers, … Growing interest in servers, server perf. Increasing desktop perf. less important Outside of graphics Improved understanding in how to use multiprocessors effectively Especially server where significant natural TLP Advantage of leveraging design investment by replication Rather than unique design 4/17/2017 CPE 731, MP Intro

Outline MP Motivation SISD v. SIMD v. MIMD Centralized vs. Distributed Memory Challenges to Parallel Programming Conclusion 4/17/2017 CPE 731, MP Intro

Flynn’s Taxonomy Flynn classified by data and control streams in 1966 M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966. Flynn classified by data and control streams in 1966 SIMD  Data Level Parallelism MIMD  Thread Level Parallelism MIMD popular because Flexible: N pgms and 1 multithreaded pgm Cost-effective: same MPU in desktop & MIMD Single Instruction Single Data (SISD) (Uniprocessor) Single Instruction Multiple Data SIMD (single PC: Vector, CM-2) Multiple Instruction Single Data (MISD) (????) Multiple Instruction Multiple Data MIMD (Clusters, SMP servers) 4/17/2017 CPE 731, MP Intro

Back to Basics “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” Parallel Architecture = Computer Architecture + Communication Architecture 2 classes of multiprocessors WRT memory: Centralized Memory Multiprocessor < few dozen processor chips (and < 100 cores) in 2006 Small enough to share single, centralized memory Physically Distributed-Memory multiprocessor Larger number chips and cores than 1. BW demands  Memory distributed among processors 4/17/2017 CPE 731, MP Intro

Outline MP Motivation SISD v. SIMD v. MIMD Centralized vs. Distributed Memory Challenges to Parallel Programming Conclusion 4/17/2017 CPE 731, MP Intro

Centralized vs. Distributed Memory Scale P 1 $ Inter connection network n Mem P 1 $ Inter connection network n Mem Centralized Memory Distributed Memory 4/17/2017 CPE 731, MP Intro

Centralized Memory Multiprocessor Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors Large caches  single memory can satisfy memory demands of small number of processors Can scale to a few dozen processors by using a switch and by using many memory banks Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases 4/17/2017 CPE 731, MP Intro

Distributed Memory Multiprocessor Pro: Cost-effective way to scale memory bandwidth If most accesses are to local memory Pro: Reduces latency of local memory accesses Con: Communicating data between processors more complex Con: Must change software to take advantage of increased memory BW 4/17/2017 CPE 731, MP Intro

2 Models for Communication and Memory Architecture Communication occurs by explicitly passing messages among the processors: message-passing multi-computers Communication occurs through a shared address space (via loads and stores): shared memory multiprocessors either UMA (Uniform Memory Access time) for shared address, centralized memory MP NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space 4/17/2017 CPE 731, MP Intro

Outline MP Motivation SISD v. SIMD v. MIMD Centralized vs. Distributed Memory Challenges to Parallel Programming Conclusion 4/17/2017 CPE 731, MP Intro

Challenges of Parallel Processing First challenge is % of program inherently sequential Suppose 80X speedup from 100 processors. What fraction of original program can be sequential? 10% 5% 1% <1% 4/17/2017 CPE 731, MP Intro

Amdahl’s Law Answers 4/17/2017 CPE 731, MP Intro

Challenges of Parallel Processing Second challenge is long latency to remote memory Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.) What is performance impact if 0.2% instructions involve remote access? 1.5X 2.0X 2.5X 4/17/2017 CPE 731, MP Intro

CPI Equation CPI = Base CPI + Remote request rate x Remote request cost CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3 No communication is 1.3/0.5 or 2.6 faster than 0.2% instructions involve local access 4/17/2017 CPE 731, MP Intro

Challenges of Parallel Processing Application parallelism  primarily via new algorithms that have better parallel performance Long remote latency impact  both by architect and by the programmer For example, reduce frequency of remote accesses either by Caching shared data (HW) Restructuring the data layout to make more accesses local (SW) 4/17/2017 CPE 731, MP Intro

And in Conclusion “End” of uniprocessors speedup => Multiprocessors Parallelism challenges: % parallalizable, long latency to remote memory Centralized vs. distributed memory Small MP vs. lower latency, larger BW for Larger MP Message Passing vs. Shared Address Uniform access time vs. Non-uniform access time 4/17/2017 CPE 731, MP Intro