Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

Similar presentations


Presentation on theme: "Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,"— Presentation transcript:

1 Computer Organization CS224 Fall 2012 Lesson 52

2 Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability, availability, power efficiency  Job-level (process-level) parallelism l High throughput for independent jobs  Parallel processing program l Single program run on multiple processors  Multicore microprocessors l Chips with multiple processors (cores) §9.1 Introduction

3 Types of Parallelism Data-Level Parallelism (DLP) Time Thread-Level Parallelism (TLP) Time Instruction-Level Parallelism (ILP) Pipelining Time

4 Hardware and Software  Hardware l Serial: e.g., Pentium 4 l Parallel: e.g., quad-core Xeon e5345  Software l Sequential: e.g., matrix multiplication l Concurrent: e.g., operating system  Sequential/concurrent software can run on serial/parallel hardware l Challenge: making effective use of parallel hardware

5 What We’ve Already Covered  §2.11: Parallelism and Instructions l Synchronization  §3.6: Parallelism and Computer Arithmetic l Associativity  §4.10: Parallelism and Advanced Instruction-Level Parallelism  §5.8: Parallelism and Memory Hierarchies l Cache Coherence (actually, we skipped this)  §6.9: Parallelism and I/O: l Redundant Arrays of Inexpensive Disks

6 Parallel Programming  Parallel software is the problem  Need to get significant performance improvement l Otherwise, just use a faster uniprocessor, since it’s easier!  Difficulties l Partitioning l Coordination l Communications overhead §7.2 The Difficulty of Creating Parallel Processing Programs

7 Amdahl’s Law  Sequential part can limit speedup  Example: 100 processors, 90× speedup? l T new = T parallelizable /100 + T sequential l l Solving: F parallelizable = 0.999  Need sequential part to be 0.1% of original time (99.9% needs to be parallelizable)  Obviously, less-than-expected speedups are common!

8 Scaling Example  Workload: sum of 10 scalars, and 10 × 10 matrix sum l Speed up from 10 to 100 processors  Single processor: Time = (10 + 100) × t add  10 processors l Time = 10 × t add + 100/10 × t add = 20 × t add l Speedup = 110/20 = 5.5 (55% of potential)  100 processors l Time = 10 × t add + 100/100 × t add = 11 × t add l Speedup = 110/11 = 10 (10% of potential)  Assumes load can be balanced across processors

9 Scaling Example (cont)  What if matrix size is 100 × 100?  Single processor: Time = (10 + 10000) × t add  10 processors l Time = 10 × t add + 10000/10 × t add = 1010 × t add l Speedup = 10010/1010 = 9.9 (99% of potential)  100 processors l Time = 10 × t add + 10000/100 × t add = 110 × t add l Speedup = 10010/110 = 91 (91% of potential)  Assuming load is balanced

10 Strong vs Weak Scaling  Strong scaling: problem size fixed l As in previous example  Weak scaling: problem size proportional to number of processors l 10 processors, 10 × 10 matrix -Time = 20 × t add l 100 processors, 32 × 32 matrix -Time = 10 × t add + 1000/100 × t add = 20 × t add l Constant performance in this example

11 Shared Memory  SMP: shared memory multiprocessor l Hardware provides single physical address space for all processors l Synchronize shared variables using locks l Memory access time -UMA (uniform) vs. NUMA (nonuniform) §7.3 Shared Memory Multiprocessors

12 Example: Sum Reduction  Sum 100,000 numbers on 100 processor UMA l Each processor has ID: 0 ≤ Pn ≤ 99 l Partition 1000 numbers per processor l Initial summation on each processor sum[Pn] = 0; for (i = 1000*Pn; i < 1000*(Pn+1); i = i + 1) sum[Pn] = sum[Pn] + A[i];  Now need to add these partial sums l Reduction: divide and conquer l Half the processors add pairs, then quarter, … l Need to synchronize between reduction steps

13 Example: Sum Reduction half = 100; repeat synch(); if (half%2 != 0 && Pn == 0) sum[0] = sum[0] + sum[half-1]; /* Conditional sum needed when half is odd; Processor0 gets missing element */ half = half/2; /* dividing line on who sums */ if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half]; until (half == 1);


Download ppt "Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,"

Similar presentations


Ads by Google