Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.

Similar presentations


Presentation on theme: "Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder."— Presentation transcript:

1 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder Chapter 2: Understanding Parallel Computers

2 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Hardware Changes quickly Our design needs to be hardware independent Concerns for stale cachelines 2-2

3 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-3 Figure 2.2 Logical Organization of the AMD Dual Core Opteron. The processors address a private L2 cache; memory consistency is provided by the System Request Interface; HyperTransport technology connects to RAM and, possibly, other Opteron chips.

4 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-4 Figure 2.3

5 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-5 Figure 2.4 Sun Fire E25K. Eighteen boards are connected with crossbars for address, data and response; each board contains four UltraSPARC IV Cu processors; the snoopy buses are shown as dashed lines.

6 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-6 Figure 2.5 Crossbar switch connecting four nodes. Notice the output and input channels; crossing wires do not connect unless a connection is shown. Each pair of nodes is directly connected by setting one of the open circles.

7 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Heterogeneous chip design General processor performs hard to parallelize portion of algorithm Attached processors perform compute-intensive portion of the computation 2-7

8 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Examples Graphic processing units Field programmable gate arrays Cell processor, designed for video games –8 specialized cores, synergistic processing elements (SPEs) –High communication bandwidth among processors –Does not provide coherent memory to SPEs Chose performance over programmer convenience 2-8

9 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-9 Figure 2.6 Architecture of the Cell processor. The architecture is designed to move data: The high speed I/O controllers have a capacity of 76.8 GB/s; each of the two channels to RAM runs at 12.8 GB/s; the capacity of the EIB is theoretically capable of 204.8 GB/s.

10 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Clusters Gigabit ethernet Myrinet –Lower protocol overhead –Better throughput Quadrics –Company formed in 96 –In 2003, 6 of 10 fastest computers based on this Infiniband – most common in supercomputers Fiber channel – connect data storage 2-10

11 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Blade servers stripped down server computer with a modular design optimized to minimize the use of physical space and energy. Whereas a standard rack-mount server can function with (at least) a power cord and network cable 2-11

12 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Super computers BlueGene/L –65,536 dual core processors –770 mhz (relatively slow) 2-12

13 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-13 Figure 2.7 Logical organization of a BlueGene/L node.

14 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-14 Figure 2.8 BlueGene/L communication networks; (a) 3D torus for standard interprocessor data transfer; (b) collective network for fast evaluation of reductions.

15 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley How do these models differ? Shared address space available to all processors –Core Duo, Dual Core, Sun Fire E25K Distributed address space –HP cluster, BlueGene/L 2-15

16 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Distributed vs Shared Memory Shared seems easier, more natural –Delays occur with more processors –Coherent issues increasing memory-reference time –Message passing is easier 2-16

17 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Flynn’s Taxonomy SISD SIMD –As in Cell’s SPE MISD MIMD 2-17

18 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Computing model -- sequential Von-Neumann architecture (random access) –Stores instruction/data w/o concern of details Simplifies talking about programs 2-18

19 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-19 Figure 2.9 Two searching computations: a) linear search, b) binary search.

20 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley PRAM model (parallel random access) Single unbounded shared memory Processors follow their own threads Execute in lock step Ignores communication costs (not good) Does not guide programmers to best solutions 2-20

21 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Candidate Type Architecture (CTA) 2 types of memory references –Inexpensive local –Expensive non-local 2-21

22 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley CTA properties P processors sequentially executing local inst. Local mem access time normal for seq. comp. Non-local mem access is 2 to 5 orders of magnitude longer. Node has 1 or 2 active network transfers at given time. Global controller does basic operations –Initiation, synchronization 2-22

23 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-23 Figure 2.10

24 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-24 Figure 2.11 Common topologies used for interconnection networks; (a) 2-D torus, (b) binary 3-cube (see Exercise 8), (c) fat tree, (d) omega network.

25 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-25 Figure 2.11 Common topologies used for interconnection networks; (a) 2-D torus, (b) binary 3-cube (see Exercise 8), (c) fat tree, (d) omega network. (cont.)

26 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-26 Figure 2.11 Common topologies used for interconnection networks; (a) 2-D torus, (b) binary 3-cube (see Exercise 8), (c) fat tree, (d) omega network. (cont.)

27 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley 2-27 Table 2.1 Estimates for λ for common architectures; speeds generally do not include congestion or other traffic delays.

28 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Properties of CTA cause: Locality rule: maximize the number of local memory references and minimize the number of non-local memory references. 2-28

29 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Example P processors need a random number Solution 1 –One processor generate the next random number Solution 2 –Send the seed to each processor, each generates its own number 2-29

30 Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley CTA Good model Should scale Abstracts general features of MIMD 2-30


Download ppt "Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder."

Similar presentations


Ads by Google