Presentation is loading. Please wait.

Presentation is loading. Please wait.

A.Broumandnia, 1 4 Models of Parallel Processing Topics in This Chapter 4.1 Development of Early Models 4.2 SIMD versus MIMD Architectures.

Similar presentations


Presentation on theme: "A.Broumandnia, 1 4 Models of Parallel Processing Topics in This Chapter 4.1 Development of Early Models 4.2 SIMD versus MIMD Architectures."— Presentation transcript:

1 A.Broumandnia, Broumandnia@gmail.com 1 4 Models of Parallel Processing Topics in This Chapter 4.1 Development of Early Models 4.2 SIMD versus MIMD Architectures 4.3 Global versus Distributed Memory 4.4 The PRAM Shared-Memory Model 4.5 Distributed-Memory or Graph Models 4.6 Circuit Model and Physical Realizations

2 4.1 Development of Early Models Associative processing (AP) was perhaps the earliest form of parallel processing. Associative or content-addressable memories (AMs, CAMs), which allow memory cells to be accessed based on contents rather than their physical locations within the memory array. Early associative memories provided two basic capabilities: 1. masked search, or looking for a particular bit pattern in selected fields of all memory words and marking those for which a match is indicated. 2. parallel write, or storing a given bit pattern into selected fields of all memory words that have been previously marked. These two basic capabilities, along with simple logical operations on mark vectors suffice for the programming of sophisticated searches or even parallel arithmetic operations. A.Broumandnia, Broumandnia@gmail.com 2 100111010110001101000 Comparand Mask Memory array with comparison logic

3 4.1 Development of Early Models Fig. 4.1: The Flynn-Johnson classification of computer systems. A.Broumandnia, Broumandnia@gmail.com 3

4 4.1 Development of Early Models The SISD class encompasses standard uniprocessor systems, including those that employ pipelining, out-of-order execution, multiple instruction issue, and several functional units to achieve higher performance. Figure 4.2 shows an example parallel processor with the MISD architecture. A single data stream enters the machine consisting of five processors. Various transformations are performed on each data item before it is passed on to the next processor(s). Successive data items can go through different transformations, either because of data-dependent conditional statements in the instruction streams (control-driven) or because of special control tags carried along with the data (data-driven). The MISD organization can thus be viewed as a flexible or high-level pipeline with multiple paths and programmable stages. A.Broumandnia, Broumandnia@gmail.com 4 Fig. 4.2

5 4.2 SIMD versus MIMD Architectures Most early parallel machines had SIMD designs. SIMD implies that a central unit fetches and interprets the instructions and then broadcasts appropriate control signals to a number of processors operating in lockstep. Within the SIMD category, two fundamental design choices exist: 1.Synchronous versus asynchronous SIMD. 2.Custom- versus commodity-chip SIMD. Most modern parallel machines have MIMD designs. Within the MIMD class, three fundamental issues or design choices are subjects of ongoing debates in the research community. 1.MPP—massively or moderately parallel processor. 2.Tightly versus loosely coupled MIMD. 3.Explicit message passing versus virtual shared memory. A.Broumandnia, Broumandnia@gmail.com 5

6 4.3 Global versus Distributed Memory Within the MIMD class of parallel processors, memory can be global or distributed. Global memory may be visualized as being in a central location where all processors can access it with equal ease. Figure 4.3 shows a possible hardware organization for a global-memory parallel processor. A.Broumandnia, Broumandnia@gmail.com 6

7 4.3 Global versus Distributed Memory Processors can access memory through a special processor-to- memory network. A global-memory multiprocessor is characterized by the type and number p of processors, the capacity and number m of memory modules, and the network architecture. Examples for both the processor-to-memory and processor- to-processor networks include 1. Crossbar switch; O(pm) complexity, and thus quite costly for highly parallel systems 2. Single or multiple buses (the latter with complete or partial connectivity) 3. Multistage interconnection network (MIN); cheaper than (1), more bandwidth (2) A.Broumandnia, Broumandnia@gmail.com 7

8 4.3 Global versus Distributed Memory One approach to reducing the amount of data that must pass through the processor-to-memory interconnection network is to use a private cache memory of reasonable size within each processor (Fig. 4.4). The reason that using cache memories reduces the traffic A.Broumandnia, Broumandnia@gmail.com 8 Challenge: Cache coherence

9 4.3 Global versus Distributed Memory However, the use of multiple caches gives rise to the cache coherence problem: Multiple copies of data in the main memory and in various caches may become inconsistent. Here, we need a more sophisticated approach, examples of which include 1. Do not cache shared data at all or allow only a single cache copy. If the volume of shared data is small and access to it infrequent, these policies work quite well. 2. Do not cache “writeable” shared data or allow only a single cache copy. Read-only shared data can be placed in multiple caches with no complication. 3. Use a cache coherence protocol. This approach may introduce a nontrivial consistency enforcement overhead, depending on the coherence protocol used, but removes the above restrictions. A.Broumandnia, Broumandnia@gmail.com 9

10 4.3 Global versus Distributed Memory Distributed-memory architectures can be conceptually viewed as in Fig. 4.5. A collection of p processors, each with its own private memory, communicates through an interconnection network. It is possible to view Fig. 4.5 as a special case of Fig. 4.4 in which the global- memory modules have been removed altogether; the fact that processors and (cache) memories appear in different orders is immaterial. This has led to the name all-cache or cache-only memory architecture (COMA) for such machines. A.Broumandnia, Broumandnia@gmail.com 10 Some Terminology: NUMA Nonuniform memory access (distributed shared memory) UMA Uniform memory access (global shared memory) COMA Cache-only memory arch

11 4.4 The PRAM Shared-Memory Model The theoretical model used for conventional or sequential computers (SISD class) is known as the random-access machine (RAM) The parallel version of RAM [PRAM (pea-ram)], constitutes an abstract model of the class of global-memory parallel processors. The abstraction consists of ignoring the details of the processor-to-memory interconnection network and taking the view that each processor can access any memory location in each machine cycle, independent of what other processors are doing. Thus, for example, PRAM algorithms might involve statements like “for 0 ≤ i < p, Processor i adds the contents of memory location 2i + 1 to the memory location 2i” A.Broumandnia, Broumandnia@gmail.com 11

12 4.4 The PRAM Shared-Memory Model Even though the global-memory architecture was introduced as a subclass of the MIMD class, the abstract PRAM model depicted in Fig. 4.6 can be SIMD or MIMD. In the SIMD variant, all processors obey the same instruction in each machine cycle A.Broumandnia, Broumandnia@gmail.com 12

13 4.4 The PRAM Shared-Memory Model In view of the direct and independent access to every memory location allowed for each processor, the PRAM model depicted in Fig. 4.6 is highly theoretical. A.Broumandnia, Broumandnia@gmail.com 13 Fig. 4.7 PRAM with some hardware details shown. PRAM Cycle: All processors read memory locations of their choosing All processors compute one step independently All processors store results into memory locations of their choosing

14 4.5 Distributed-Memory or Graph Models Given the internal processor and memory structures in each node, a distributed-memory architecture is characterized primarily by the network used to interconnect the nodes. This network is usually represented as a graph, with vertices corresponding to processor–memory nodes and edges corresponding to communication links. If communication links are unidirectional, then directed edges are used. Undirected edges imply bidirectional communication, although not necessarily in both directions at once. A.Broumandnia, Broumandnia@gmail.com 14

15 4.5 Distributed-Memory or Graph Models Important parameters of an interconnection network include 1.Network diameter: the longest of the shortest paths between various pairs of nodes. 2.Bisection (band)width: the smallest number (total capacity) of links that need to be cut in order to divide the network into two subnetworks of half the size. 3.Vertex or node degree: the number of communication ports required of each node, Table 4.2 lists these three parameters for some of the commonly used interconnection networks. examples for some of these networks appear in Fig. 4.8 A.Broumandnia, Broumandnia@gmail.com 15

16 A.Broumandnia, Broumandnia@gmail.com 16 Fig. 4.8 The sea of interconnection networks.

17 A.Broumandnia, Broumandnia@gmail.com 17 Some Interconnection Networks (Table 4.2) ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– Number NetworkBisection Node Local Network name(s) of nodes diameterwidth degree links? ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– 1D mesh (linear array)k k – 11 2 Yes 1D torus (ring, loop)k k/22 2 Yes 2D Meshk 2 2k – 2k 4 Yes 2D torus (k-ary 2-cube)k 2 k 2k 4 Yes 1 3D meshk 3 3k – 3k 2 6 Yes 3D torus (k-ary 3-cube)k 3 3k/22k 2 6 Yes 1 Pyramid(4k 2 – 1)/3 2 log 2 k2k 9 No Binary tree2 l – 1 2l – 21 3 No 4-ary hypertree2 l (2 l+1 – 1) 2l2 l+1 6 No Butterfly2 l (l + 1) 2l 2 l 4 No Hypercube2 l l2 l–1 l No Cube-connected cycles2 l l 2l2 l–1 3 No Shuffle-exchange2 l 2l – 1  2 l–1 /l 4 unidir. No De Bruijn2 l l2 l /l 4 unidir. No –––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– 1 With folded layout

18 4.5 Distributed-Memory or Graph Models Whereas direct interconnection networks of the types shown in Table 4.2 or Fig. 4.8 have led to many important classes of parallel processors, bus-based architectures still dominate the small-scale-parallel machines. Because a single bus can quickly become a performance bottleneck as the number of processors increases, a variety of multiple-bus architectures and hierarchical schemes (Fig. 4.9) are available for reducing bus traffic by taking advantage of the locality of communication within small clusters of processors. A.Broumandnia, Broumandnia@gmail.com 18


Download ppt "A.Broumandnia, 1 4 Models of Parallel Processing Topics in This Chapter 4.1 Development of Early Models 4.2 SIMD versus MIMD Architectures."

Similar presentations


Ads by Google