Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 258 Parallel Computer Architecture Lecture 2 Convergence of Parallel Architectures January 25, 2002 Prof John D. Kubiatowicz

Similar presentations


Presentation on theme: "CS 258 Parallel Computer Architecture Lecture 2 Convergence of Parallel Architectures January 25, 2002 Prof John D. Kubiatowicz"— Presentation transcript:

1 CS 258 Parallel Computer Architecture Lecture 2 Convergence of Parallel Architectures January 25, 2002 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs258

2 1/25/02 John Kubiatowicz Slide 2.2 Recap of Lecture 1 Parallel Comp. Architecture driven by familiar technological and economic forces –application/platform cycle, but focused on the most demanding applications –hardware/software learning curve More attractive than ever because ‘best’ building block - the microprocessor - is also the fastest BB. History of microprocessor architecture is parallelism –translates area and denisty into performance The Future is higher levels of parallelism –Parallel Architecture concepts apply at many levels –Communication also on exponential curve => Quantitative Engineering approach New Applications More Performance Speedup

3 1/25/02 John Kubiatowicz Slide 2.3 Application Software System Software SIMD Message Passing Shared Memory Dataflow Systolic Arrays Architecture History Parallel architectures tied closely to programming models –Divergent architectures, with no predictable pattern of growth. –Mid 80s rennaisance

4 1/25/02 John Kubiatowicz Slide 2.4 Plan for Today Look at major programming models –where did they come from? –The 80s architectural rennaisance! –What do they provide? –How have they converged? Extract general structure and fundamental issues SIMD Message Passing Shared Memory Dataflow Systolic Arrays Generic Architecture

5 1/25/02 John Kubiatowicz Slide 2.5 Programming Model Conceptualization of the machine that programmer uses in coding applications –How parts cooperate and coordinate their activities –Specifies communication and synchronization operations Multiprogramming –no communication or synch. at program level Shared address space –like bulletin board Message passing –like letters or phone calls, explicit point to point Data parallel: –more regimented, global actions on data –Implemented with shared address space or message passing

6 1/25/02 John Kubiatowicz Slide 2.6 Shared Memory  Shared Addr. Space Range of addresses shared by all processors –All communication is Implicit (Through memory) –Want to communicate a bunch of info? Pass pointer. Programming is “straightforward” –Generalization of multithreaded programming Shared Memory Processor

7 1/25/02 John Kubiatowicz Slide 2.7 Historical Development “Mainframe” approach –Motivated by multiprogramming –Extends crossbar used for Mem and I/O –Processor cost-limited => crossbar –Bandwidth scales with p –High incremental cost »use multistage instead “Minicomputer” approach –Almost all microprocessor systems have bus –Motivated by multiprogramming, TP –Used heavily for parallel computing –Called symmetric multiprocessor (SMP) –Latency larger than for uniprocessor –Bus is bandwidth bottleneck »caching is key: coherence problem –Low incremental cost

8 1/25/02 John Kubiatowicz Slide 2.8 Adding Processing Capacity Memory capacity increased by adding modules I/O by controllers and devices Add processors for processing! – For higher-throughput multiprogramming, or parallel programs

9 1/25/02 John Kubiatowicz Slide 2.9 Shared Physical Memory Any processor can directly reference any location –Communication operation is load/store –Special operations for synchronization Any I/O controller - any memory Operating system can run on any processor, or all. – OS uses shared memory to coordinate What about application processes?

10 1/25/02 John Kubiatowicz Slide 2.10 Shared Virtual Address Space Process = address space plus thread of control Virtual-to-physical mapping can be established so that processes shared portions of address space. –User-kernel or multiple processes Multiple threads of control on one address space. –Popular approach to structuring OS’s –Now standard application capability (ex: POSIX threads) Writes to shared address visible to other threads – Natural extension of uniprocessors model – conventional memory operations for communication – special atomic operations for synchronization » also load/stores

11 1/25/02 John Kubiatowicz Slide 2.11 Structured Shared Address Space Add hoc parallelism used in system code Most parallel applications have structured SAS Same program on each processor –shared variable X means the same thing to each thread

12 1/25/02 John Kubiatowicz Slide 2.12 Cache Coherence Problem Caches are aliases for memory locations Does every processor eventually see new value? Tightly related: Cache Consistency –In what order do writes appear to other processors? Buses make this easy: every processor can snoop on every write –Essential feature: Broadcast $4 4 5 6 7 0 1 2 3 W R? $4 R? Miss Write-Through?

13 1/25/02 John Kubiatowicz Slide 2.13 Engineering: Intel Pentium Pro Quad –All coherence and multiprocessing glue in processor module –Highly integrated, targeted at high volume –Low latency and bandwidth

14 1/25/02 John Kubiatowicz Slide 2.14 Engineering: SUN Enterprise Proc + mem card - I/O card –16 cards of either type –All memory accessed over bus, so symmetric –Higher bandwidth, higher latency bus

15 1/25/02 John Kubiatowicz Slide 2.15 Scaling Up –Problem is interconnect: cost (crossbar) or bandwidth (bus) –Dance-hall: bandwidth still scalable, but lower cost than crossbar »latencies to memory uniform, but uniformly large –Distributed memory or non-uniform memory access (NUMA) »Construct shared address space out of simple message transactions across a general-purpose network (e.g. read- request, read-response) –Caching shared (particularly nonlocal) data? MMM  M MM Network P $ P $ P $ P $ P $ P $ “Dance hall”Distributed memory Omega Network General Network

16 1/25/02 John Kubiatowicz Slide 2.16 Stanford DASH Clusters of 4 processors share 2 nd -level cache Up to 16 clusters tied together with 2-dim mesh 16-bit directory associated with every memory line –Each memory line has home cluster that contains DRAM –The 16-bit vector says which clusters (if any) have read copies –Only one writer permitted at a time Never got more than 12 clusters (48 processors) working at one time: Asynchronous network probs! L2-Cache P L1-$ P P P

17 1/25/02 John Kubiatowicz Slide 2.17 The MIT Alewife Multiprocessor Cache-coherence Shared Memory –Partially in Software! –Limited Directory + software overflow User-level Message-Passing Rapid Context-Switching 2-dimentional Asynchronous network One node/board Got 32-processors (+ I/O boards) working

18 1/25/02 John Kubiatowicz Slide 2.18 Engineering: Cray T3E –Scale up to 1024 processors, 480MB/s links –Memory controller generates request message for non-local references –No hardware mechanism for coherence »SGI Origin etc. provide this

19 1/25/02 John Kubiatowicz Slide 2.19 What is underlying Shared Memory?? Packet switched networks better utilize available link bandwidth than circuit switched networks So, network passes messages around! SIMD Message Passing Shared Memory Dataflow Systolic Arrays Generic Architecture M  MM Network P $ P $ P $

20 1/25/02 John Kubiatowicz Slide 2.20 Message Passing Architectures Complete computer as building block, including I/O –Communication via Explicit I/O operations Programming model –direct access only to private address space (local memory), –communication via explicit messages (send/receive) High-level block diagram –Communication integration? »Mem, I/O, LAN, Cluster –Easier to build and scale than SAS Programming model more removed from basic hardware operations –Library or OS intervention M  MM Network P $ P $ P $

21 1/25/02 John Kubiatowicz Slide 2.21 Message-Passing Abstraction –Send specifies buffer to be transmitted and receiving process –Recv specifies sending process and application storage to receive into –Memory to memory copy, but need to name processes –Optional tag on send and matching rule on receive –User process names local data and entities in process/tag space too –In simplest form, the send/recv match achieves pairwise synch event »Other variants too –Many overheads: copying, buffer management, protection Process PProcessQ Address Y AddressX SendX, Q, t ReceiveY, P, t Match Local process address space Local process address space

22 1/25/02 John Kubiatowicz Slide 2.22 Evolution of Message-Passing Machines Early machines: FIFO on each link –HW close to prog. Model; –synchronous ops –topology central (hypercube algorithms) CalTech Cosmic Cube (Seitz, CACM Jan 95)

23 1/25/02 John Kubiatowicz Slide 2.23 MIT J-Machine (Jelly-bean machine) 3-dimensional network topology –Non-adaptive, E-cubed routing –Hardware routing –Maximize density of communication 64-nodes/board, 1024 nodes total Low-powered processors –Message passing instructions –Associative array primitives to aid in synthesizing shared-address space Extremely fine-grained communication –Hardware-supported Active Messages

24 1/25/02 John Kubiatowicz Slide 2.24 Diminishing Role of Topology? Shift to general links –DMA, enabling non-blocking ops »Buffered by system at destination until recv –Store&forward routing Fault-tolerant, multi-path routing: Diminishing role of topology –Any-to-any pipelined routing –node-network interface dominates communication time –Simplifies programming –Allows richer design space »grids vs hypercubes H x (T 0 + n/B) vs T0 + H  + n/B Intel iPSC/1 -> iPSC/2 -> iPSC/860

25 1/25/02 John Kubiatowicz Slide 2.25 Example Intel Paragon

26 1/25/02 John Kubiatowicz Slide 2.26 Building on the mainstream: IBM SP-2 Made out of essentially complete RS6000 workstations Network interface integrated in I/O bus (bw limited by I/O bus)

27 1/25/02 John Kubiatowicz Slide 2.27 Berkeley NOW 100 Sun Ultra2 workstations Inteligent network interface –proc + mem Myrinet Network –160 MB/s per link –300 ns per hop

28 1/25/02 John Kubiatowicz Slide 2.28 Data Parallel Systems Programming model –Operations performed in parallel on each element of data structure –Logically single thread of control, performs sequential or parallel steps –Conceptually, a processor associated with each data element Architectural model –Array of many simple, cheap processors with little memory each »Processors don’t sequence through instructions –Attached to a control processor that issues instructions –Specialized and general communication, cheap global synchronization Original motivations – Matches simple differential equation solvers – Centralize high cost of instruction fetch/sequencing

29 1/25/02 John Kubiatowicz Slide 2.29 Application of Data Parallelism –Each PE contains an employee record with his/her salary If salary > 100K then salary = salary *1.05 else salary = salary *1.10 –Logically, the whole operation is a single step –Some processors enabled for arithmetic operation, others disabled Other examples: –Finite differences, linear algebra,... –Document searching, graphics, image processing,... Some recent machines: –Thinking Machines CM-1, CM-2 (and CM-5) –Maspar MP-1 and MP-2,

30 1/25/02 John Kubiatowicz Slide 2.30 Connection Machine (Tucker, IEEE Computer, Aug. 1988)

31 1/25/02 John Kubiatowicz Slide 2.31 Evolution and Convergence SIMD Popular when cost savings of centralized sequencer high –60s when CPU was a cabinet –Replaced by vectors in mid-70s »More flexible w.r.t. memory layout and easier to manage –Revived in mid-80s when 32-bit datapath slices just fit on chip Simple, regular applications have good locality Programming model converges with SPMD (single program multiple data) –need fast global synchronization –Structured global address space, implemented with either SAS or MP

32 1/25/02 John Kubiatowicz Slide 2.32 CM-5 Repackaged SparcStation –4 per board Fat-Tree network Control network for global synchronization

33 1/25/02 John Kubiatowicz Slide 2.33 SIMD Message Passing Shared Memory Dataflow Systolic Arrays Generic Architecture

34 1/25/02 John Kubiatowicz Slide 2.34 Dataflow Architectures Represent computation as a graph of essential dependences –Logical processor at each node, activated by availability of operands –Message (tokens) carrying tag of next instruction sent to next processor –Tag compared with others in matching store; match fires execution

35 1/25/02 John Kubiatowicz Slide 2.35 Evolution and Convergence Key characteristics –Ability to name operations, synchronization, dynamic scheduling Problems –Operations have locality across them, useful to group together –Handling complex data structures like arrays –Complexity of matching store and memory units –Expose too much parallelism (?) Converged to use conventional processors and memory –Support for large, dynamic set of threads to map to processors –Typically shared address space as well –But separation of progr. model from hardware (like data-parallel) Lasting contributions: –Integration of communication with thread (handler) generation –Tightly integrated communication and fine-grained synchronization –Remained useful concept for software (compilers etc.)

36 1/25/02 John Kubiatowicz Slide 2.36 Systolic Architectures VLSI enables inexpensive special-purpose chips –Represent algorithms directly by chips connected in regular pattern –Replace single processor with array of regular processing elements –Orchestrate data flow for high throughput with less memory access Different from pipelining –Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory SIMD? : each PE may do something different

37 1/25/02 John Kubiatowicz Slide 2.37 Systolic Arrays (contd.) –Practical realizations (e.g. iWARP) use quite general processors »Enable variety of algorithms on same hardware –But dedicated interconnect channels »Data transfer directly from register to register across channel –Specialized, and same problems as SIMD »General purpose systems work well for same algorithms (locality etc.) Example: Systolic array for 1-D convolution

38 1/25/02 John Kubiatowicz Slide 2.38 Toward Architectural Convergence Evolution and role of software have blurred boundary –Send/recv supported on SAS machines via buffers –Can construct global address space on MP (GA -> P | LA) –Page-based (or finer-grained) shared virtual memory Hardware organization converging too –Tighter NI integration even for MP (low-latency, high-bandwidth) –Hardware SAS passes messages Even clusters of workstations/SMPs are parallel systems –Emergence of fast system area networks (SAN) Programming models distinct, but organizations converging –Nodes connected by general network and communication assists –Implementations also converging, at least in high-end machines

39 1/25/02 John Kubiatowicz Slide 2.39 Convergence: Generic Parallel Architecture Node: processor(s), memory system, plus communication assist –Network interface and communication controller Scalable network Convergence allows lots of innovation, within framework –Integration of assist with node, what operations, how efficiently...

40 1/25/02 John Kubiatowicz Slide 2.40 Flynn’s Taxonomy # instruction x # Data –Single Instruction Single Data (SISD) –Single Instruction Multiple Data (SIMD) –Multiple Instruction Single Data –Multiple Instruction Multiple Data (MIMD) Everything is MIMD!

41 1/25/02 John Kubiatowicz Slide 2.41 Conclusion Several major types of communication: –Shared Memory –Message Passing –Data-Parallel –Systolic –DataFlow Is communication “Turing-complete”? –Can simulate each of these on top of the other! Many tradeoffs in hardware support Communication is a first-class citizen! –How to perform communication is essential »IS IT IMPLICIT or EXPLICIT? –What to do with communication errors? –Does locality matter???


Download ppt "CS 258 Parallel Computer Architecture Lecture 2 Convergence of Parallel Architectures January 25, 2002 Prof John D. Kubiatowicz"

Similar presentations


Ads by Google