1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Lecture 6: Multicore Systems
Princess Sumaya Univ. Computer Engineering Dept. Chapter 7:
CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Multithreading II Steve Ko Computer Sciences and Engineering University at Buffalo.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Parallel Processing Architectures Laxmi Narayan Bhuyan
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Multi-core architectures. Single-core computer Single-core CPU chip.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Lecture 13: Multiprocessors Kai Bu
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Pipelining and Parallelism Mark Staveley
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall” Power expensive, Xtors free (Can put more on chip than can.
Outline Why this subject? What is High Performance Computing?
CSC 7080 Graduate Computer Architecture Lec 8 – Multiprocessors & Thread- Level Parallelism (3) – Sun T1 Dr. Khalaf Notes adapted from: David Patterson.
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
1 Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric shared-memory multiprocessors (Sections )
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
CS203 – Advanced Computer Architecture Performance Evaluation.
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Lecture 13: Multiprocessors Kai Bu
CS203 – Advanced Computer Architecture
CSE 502 Graduate Computer Architecture Lec – Symmetric MultiProcessing
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Parallel Processing - introduction
Electrical and Computer Engineering
CPSC 614 Computer Architecture Lec 9 – Multiprocessor
Electrical and Computer Engineering
Flynn’s Taxonomy Flynn classified by data and control streams in 1966
Merry Christmas Good afternoon, class,
Kai Bu 13 Multiprocessors So today, we’ll finish the last part of our lecture sessions, multiprocessors.
Chapter 5 Multiprocessor and Thread-Level Parallelism
Multicore / Multiprocessor Architectures
Chapter 17 Parallel Processing
Department of Computer Sciences The George Washington University
Multiprocessors - Flynn’s taxonomy (1966)
Multiprocessors CS258 S99.
CS 213: Parallel Processing Architectures
Introduction to Multiprocessors
Parallel Processing Architectures
CSC3050 – Computer Architecture
Chapter 4 Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Multiprocessor and Thread-Level Parallelism Chapter 4
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Utsunomiya University
Presentation transcript:

1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was also addressable as registers.

2 COMP 740: Computer Architecture and Implementation Montek Singh Thu, April 2, 2009 Topic: Multiprocessors I

3 3 Uniprocessor Performance (SPECint) VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, X

4 4 Déjà vu all over again? “… today’s processors … are nearing an impasse as technologies approach the speed of light..” David Mitchell, The Transputer: The Time Is Now (1989)  Transputer had bad timing (Uniprocessor performance  )  Procrastination rewarded: 2X seq. perf. / 1.5 years  “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005)  All microprocessor companies switch to MP (2X CPUs / 2 yrs)  Procrastination penalized: 2X sequential perf. / 5 yrs Manufacturer/YearAMD/’05Intel/’06IBM/’04Sun/’05 Processors/chip2228 Threads/Processor1224 Threads/chip24432

5 5 Other Factors  Multiprocessors  Growth in data-intensive applications Data bases, file servers, … Data bases, file servers, …  Growing interest in servers, server perf.  Increasing desktop perf. less important Outside of graphics Outside of graphics  Improved understanding in how to use multiprocessors effectively Especially server where significant natural TLP Especially server where significant natural TLP  Advantage of leveraging design investment by replication Rather than unique design Rather than unique design

6 6 Flynn’s Taxonomy  Flynn classified by data and control in 1966  SIMD  Data Level Parallelism  MIMD  Thread Level Parallelism  MIMD popular because Flexible: N pgms and 1 multithreaded pgm Flexible: N pgms and 1 multithreaded pgm Cost-effective: same MPU in desktop & MIMD Cost-effective: same MPU in desktop & MIMD Single Instruction Single Data (SISD) (Uniprocessor) Single Instruction Multiple Data SIMD (single PC: Vector, CM-2) Multiple Instruction Single Data (MISD) (????) Multiple Instruction Multiple Data MIMD (Clusters, SMP servers) M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, , Dec

7 7 Back to Basics  Parallel Architecture = Computer Architecture + Communication Architecture  2 classes of multiprocessors WRT memory: 1. Centralized Memory Multiprocessor  < few dozen processor chips (and < 100 cores) in 2006  Small enough to share single, centralized memory 2. Physically Distributed-Memory multiprocessor  Larger number chips and cores than 1.  BW demands  Memory distributed among processors

8 8 Centralized vs. Distributed Memory Centralized Memory Distributed Memory

9 9 Centralized Memory Multiprocessor  Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors because single main memory has a symmetric relationship to all processors  Large caches and single memory can satisfy memory demands of small number of processors Can scale to a few dozen processors by using a switch instead of bus, and many memory banks Can scale to a few dozen processors by using a switch instead of bus, and many memory banks Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases

10 Distributed Memory Multiprocessor  Pros: Cost-effective way to scale memory bandwidth Cost-effective way to scale memory bandwidth  If most accesses are to local memory Reduces latency of local memory accesses Reduces latency of local memory accesses  Cons: Communicating data between processors more complex Communicating data between processors more complex Must change software to take advantage of increased memory BW Must change software to take advantage of increased memory BW

11 2 Models for Comm and Mem Arch 1. Communication occurs explicitly by passing messages among the processors: message- passing multiprocessors by passing messages among the processors: message- passing multiprocessors 2. Communication occurs implicitly through a shared address space (via loads and stores): shared memory multiprocessors through a shared address space (via loads and stores): shared memory multiprocessors Either: Either:  UMA (Uniform Memory Access time) for shared address, centralized memory MP  NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP  Note: In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space

12 Challenges of Parallel Processing  First challenge is Amdahl’s Law: what % of program inherently sequential? what % of program inherently sequential? Suppose 80X speedup from 100 processors. What fraction of original program can be sequential? Suppose 80X speedup from 100 processors. What fraction of original program can be sequential?  a. 10%  b. 5%  c. 1%  d. <1%

13 Amdahl’s Law Answers

14 Challenges of Parallel Processing  Second challenge: long latency to remote memory Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.) Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.) What is performance impact if 0.2% instructions involve remote access? What is performance impact if 0.2% instructions involve remote access?  a. 1.5X  b. 2.0X  c. 2.5X

15 CPI Equation  CPI = Base CPI + Remote request rate x Remote request cost  CPI = % x 400 = = 1.3  No communication is 1.3/0.5 or 2.6 faster than 0.2% instructions involve local access

16 Challenges of Parallel Processing 1. Application parallelism – primarily via new algorithms that have better parallel performance 2. Long remote latency impact For example, reduce frequency of remote accesses either by For example, reduce frequency of remote accesses either by  Caching shared data (HW)  Restructuring the data layout to make more accesses local (SW) We’ll look at reducing latency via caches We’ll look at reducing latency via caches

17 T1 (“Niagara”)  Target: Commercial server applications High thread level parallelism (TLP) High thread level parallelism (TLP)  Large numbers of parallel client requests Low instruction level parallelism (ILP) Low instruction level parallelism (ILP)  High cache miss rates  Many unpredictable branches  Frequent load-load dependencies  Power, cooling, and space are major concerns for data centers  Metric: Performance/Watt/Sq. Ft.  Approach: Multicore, Fine-grain multithreading, Simple pipeline, Small L1 caches, Shared L2

18 T1 Architecture  Also ships with 6 or 4 processors

19 T1 pipeline  Single issue, in-order, 6-deep pipeline: F, S, D, E, M, W  3 clock delays for loads & branches.  Shared units: L1, L2 L1, L2 TLB TLB

20 T1 Fine-Grained Multithreading  Each core: supports four threads supports four threads has its own level one caches (16KB instr and 8 KB data) has its own level one caches (16KB instr and 8 KB data) Switches to a new thread on each clock cycle Switches to a new thread on each clock cycle  Idle threads are bypassed in the scheduling –Waiting due to a pipeline delay or cache miss  Processor is idle only when all 4 threads are idle or stalled Both loads and branches incur a 3 cycle delay that can only be hidden by other threads Both loads and branches incur a 3 cycle delay that can only be hidden by other threads  A single set of floating point functional units is shared by all 8 cores floating point performance not focus for T1 floating point performance not focus for T1

21 Conclusion  Parallelism challenges: % parallelizable, long latency to remote memory  Centralized vs. distributed memory Small MP vs. lower latency, larger BW for Larger MP Small MP vs. lower latency, larger BW for Larger MP  Message Passing vs. Shared Address Uniform access time vs. Non-uniform access time Uniform access time vs. Non-uniform access time  Cache critical  Next: Review of caching (App. C) Review of caching (App. C) Methods to ensure cache consistency in SMPs Methods to ensure cache consistency in SMPs