Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was.

Similar presentations


Presentation on theme: "1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was."— Presentation transcript:

1 1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was also addressable as registers.

2 2 COMP 740: Computer Architecture and Implementation Montek Singh Thu, April 2, 2009 Topic: Multiprocessors I

3 3 3 Uniprocessor Performance (SPECint) VAX : 25%/year 1978 to 1986 RISC + x86: 52%/year 1986 to 2002 RISC + x86: ??%/year 2002 to present From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 3X

4 4 4 Déjà vu all over again? “… today’s processors … are nearing an impasse as technologies approach the speed of light..” David Mitchell, The Transputer: The Time Is Now (1989)  Transputer had bad timing (Uniprocessor performance  )  Procrastination rewarded: 2X seq. perf. / 1.5 years  “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005)  All microprocessor companies switch to MP (2X CPUs / 2 yrs)  Procrastination penalized: 2X sequential perf. / 5 yrs Manufacturer/YearAMD/’05Intel/’06IBM/’04Sun/’05 Processors/chip2228 Threads/Processor1224 Threads/chip24432

5 5 5 Other Factors  Multiprocessors  Growth in data-intensive applications Data bases, file servers, … Data bases, file servers, …  Growing interest in servers, server perf.  Increasing desktop perf. less important Outside of graphics Outside of graphics  Improved understanding in how to use multiprocessors effectively Especially server where significant natural TLP Especially server where significant natural TLP  Advantage of leveraging design investment by replication Rather than unique design Rather than unique design

6 6 6 Flynn’s Taxonomy  Flynn classified by data and control in 1966  SIMD  Data Level Parallelism  MIMD  Thread Level Parallelism  MIMD popular because Flexible: N pgms and 1 multithreaded pgm Flexible: N pgms and 1 multithreaded pgm Cost-effective: same MPU in desktop & MIMD Cost-effective: same MPU in desktop & MIMD Single Instruction Single Data (SISD) (Uniprocessor) Single Instruction Multiple Data SIMD (single PC: Vector, CM-2) Multiple Instruction Single Data (MISD) (????) Multiple Instruction Multiple Data MIMD (Clusters, SMP servers) M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

7 7 7 Back to Basics  Parallel Architecture = Computer Architecture + Communication Architecture  2 classes of multiprocessors WRT memory: 1. Centralized Memory Multiprocessor  < few dozen processor chips (and < 100 cores) in 2006  Small enough to share single, centralized memory 2. Physically Distributed-Memory multiprocessor  Larger number chips and cores than 1.  BW demands  Memory distributed among processors

8 8 8 Centralized vs. Distributed Memory Centralized Memory Distributed Memory

9 9 9 Centralized Memory Multiprocessor  Also called symmetric multiprocessors (SMPs) because single main memory has a symmetric relationship to all processors because single main memory has a symmetric relationship to all processors  Large caches and single memory can satisfy memory demands of small number of processors Can scale to a few dozen processors by using a switch instead of bus, and many memory banks Can scale to a few dozen processors by using a switch instead of bus, and many memory banks Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases

10 10 Distributed Memory Multiprocessor  Pros: Cost-effective way to scale memory bandwidth Cost-effective way to scale memory bandwidth  If most accesses are to local memory Reduces latency of local memory accesses Reduces latency of local memory accesses  Cons: Communicating data between processors more complex Communicating data between processors more complex Must change software to take advantage of increased memory BW Must change software to take advantage of increased memory BW

11 11 2 Models for Comm and Mem Arch 1. Communication occurs explicitly by passing messages among the processors: message- passing multiprocessors by passing messages among the processors: message- passing multiprocessors 2. Communication occurs implicitly through a shared address space (via loads and stores): shared memory multiprocessors through a shared address space (via loads and stores): shared memory multiprocessors Either: Either:  UMA (Uniform Memory Access time) for shared address, centralized memory MP  NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed memory MP  Note: In past, confusion whether “sharing” means sharing physical memory (Symmetric MP) or sharing address space

12 12 Challenges of Parallel Processing  First challenge is Amdahl’s Law: what % of program inherently sequential? what % of program inherently sequential? Suppose 80X speedup from 100 processors. What fraction of original program can be sequential? Suppose 80X speedup from 100 processors. What fraction of original program can be sequential?  a. 10%  b. 5%  c. 1%  d. <1%

13 13 Amdahl’s Law Answers

14 14 Challenges of Parallel Processing  Second challenge: long latency to remote memory Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.) Suppose 32 CPU MP, 2GHz, 200 ns remote memory, all local accesses hit memory hierarchy and base CPI is 0.5. (Remote access = 200/0.5 = 400 clock cycles.) What is performance impact if 0.2% instructions involve remote access? What is performance impact if 0.2% instructions involve remote access?  a. 1.5X  b. 2.0X  c. 2.5X

15 15 CPI Equation  CPI = Base CPI + Remote request rate x Remote request cost  CPI = 0.5 + 0.2% x 400 = 0.5 + 0.8 = 1.3  No communication is 1.3/0.5 or 2.6 faster than 0.2% instructions involve local access

16 16 Challenges of Parallel Processing 1. Application parallelism – primarily via new algorithms that have better parallel performance 2. Long remote latency impact For example, reduce frequency of remote accesses either by For example, reduce frequency of remote accesses either by  Caching shared data (HW)  Restructuring the data layout to make more accesses local (SW) We’ll look at reducing latency via caches We’ll look at reducing latency via caches

17 17 T1 (“Niagara”)  Target: Commercial server applications High thread level parallelism (TLP) High thread level parallelism (TLP)  Large numbers of parallel client requests Low instruction level parallelism (ILP) Low instruction level parallelism (ILP)  High cache miss rates  Many unpredictable branches  Frequent load-load dependencies  Power, cooling, and space are major concerns for data centers  Metric: Performance/Watt/Sq. Ft.  Approach: Multicore, Fine-grain multithreading, Simple pipeline, Small L1 caches, Shared L2

18 18 T1 Architecture  Also ships with 6 or 4 processors

19 19 T1 pipeline  Single issue, in-order, 6-deep pipeline: F, S, D, E, M, W  3 clock delays for loads & branches.  Shared units: L1, L2 L1, L2 TLB TLB

20 20 T1 Fine-Grained Multithreading  Each core: supports four threads supports four threads has its own level one caches (16KB instr and 8 KB data) has its own level one caches (16KB instr and 8 KB data) Switches to a new thread on each clock cycle Switches to a new thread on each clock cycle  Idle threads are bypassed in the scheduling –Waiting due to a pipeline delay or cache miss  Processor is idle only when all 4 threads are idle or stalled Both loads and branches incur a 3 cycle delay that can only be hidden by other threads Both loads and branches incur a 3 cycle delay that can only be hidden by other threads  A single set of floating point functional units is shared by all 8 cores floating point performance not focus for T1 floating point performance not focus for T1

21 21 Conclusion  Parallelism challenges: % parallelizable, long latency to remote memory  Centralized vs. distributed memory Small MP vs. lower latency, larger BW for Larger MP Small MP vs. lower latency, larger BW for Larger MP  Message Passing vs. Shared Address Uniform access time vs. Non-uniform access time Uniform access time vs. Non-uniform access time  Cache critical  Next: Review of caching (App. C) Review of caching (App. C) Methods to ensure cache consistency in SMPs Methods to ensure cache consistency in SMPs


Download ppt "1 Burroughs B5500 multiprocessor. These machines were designed to support HLLs, such as Algol. They used a stack architecture, but part of the stack was."

Similar presentations


Ads by Google