Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multiprocessors Flynn's classification Vector computers

Similar presentations


Presentation on theme: "Multiprocessors Flynn's classification Vector computers"— Presentation transcript:

1 Multiprocessors Flynn's classification Vector computers
Pipelining in Vector computers Cray Multiprocessor interconnection General purpose Multiprocessor Data Flow Computers Advanced Computers Architecture, UNIT 4

2 The Big Picture: Where are We Now?
The major issue is this: We’ve taken copies of the contents of main memory and put them in caches closer to the processors. But what happens to those copies if someone else wants to use the main memory data? How do we keep all copies of the data in synch with each other? Advanced Computers Architecture, UNIT 4

3 The Multiprocessor Picture
Processor/Memory Bus Example: Pentium System Organization PCI Bus I/O Busses Advanced Computers Architecture, UNIT 4

4 Why Buy a Multiprocessor?
Multiple users. Multiple applications. Multitasking within an application. Responsiveness and/or throughput. CS 284a, 7 October 97 Copyright (c) , John Thornley Advanced Computers Architecture, UNIT 4

5 Multiprocessor Architectures
Message-Passing Architectures Separate address space for each processor. Processors communicate via message passing. Shared-Memory Architectures Single address space shared by all processors. Processors communicate by memory read/write. SMP or NUMA. Cache coherence is important issue. Advanced Computers Architecture, UNIT 4

6 Message-Passing Architecture
. . . processor cache memory interconnection network CS 284a, 7 October 97 Copyright (c) , John Thornley Advanced Computers Architecture, UNIT 4

7 Shared-Memory Architecture
. . . processor 1 processor 2 processor N cache cache cache interconnection network . . . memory 1 memory 2 memory M CS 284a, 7 October 97 Copyright (c) , John Thornley Advanced Computers Architecture, UNIT 4

8 Shared-Memory Architecture: SMP and NUMA
SMP = Symmetric Multiprocessor All memory is equally close to all processors. Typical interconnection network is a shared bus. Easier to program, but doesn’t scale to many processors. NUMA = Non-Uniform Memory Access Each memory is closer to some processors than others. a.k.a. “Distributed Shared Memory”. Typically interconnection is grid or hypercube. Harder to program, but scales to more processors. Advanced Computers Architecture, UNIT 4

9 Shared Memory Multiprocessor
Registers Caches Processor Registers Caches Processor Registers Caches Processor Registers Caches Processor Memory Chipset Memory: centralized with Uniform Memory Access time (“uma”) and bus interconnect, I/O Examples: Sun Enterprise 6000, SGI Challenge, Intel SystemPro Disk & other IO Advanced Computers Architecture, UNIT 4

10 Shared Memory Multiprocessor
Several processors share one address space conceptually a shared memory often implemented just like a multicomputer address space distributed over private memories Communication is implicit read and write accesses to shared memory locations Synchronization via shared memory locations spin waiting for non-zero barriers P P P Network/Bus M Conceptual Model Advanced Computers Architecture, UNIT 4

11 Message Passing Multicomputers
Computers (nodes) connected by a network Fast network interface Send, receive, barrier Nodes not different than regular PC or workstation Cluster conventional workstations or PCs with fast network cluster computing Berkley NOW IBM SP2 P M Network Node Advanced Computers Architecture, UNIT 4

12 Large-Scale MP Designs
Memory: distributed with nonuniform memory access time (“numa”) and scalable interconnect (distributed memory) 100 cycles 40 cycles Low Latency High Reliability Advanced Computers Architecture, UNIT 4

13 In this section we will understand the issues around:
Shared Memory Architectures In this section we will understand the issues around: Sharing one memory space among several processors. Maintaining coherence among several copies of a data item. Advanced Computers Architecture, UNIT 4

14 The Problem of Cache Coherency
CPU CPU CPU Cache 100 200 Cache 550 200 Cache 100 200 A’ A’ A’ B’ B’ B’ Memory 100 200 Memory 100 200 Memory 100 440 A A A B B B I/O I/O Output of A gives 100 I/O Input 440 to B a) Cache and memory coherent: A’ = A, B’ = B. b) Cache and memory incoherent: A’ ^= A. c) Cache and memory incoherent: B’ ^= B. Advanced Computers Architecture, UNIT 4

15 Some Simple Definitions
Mechanism How It Works Performance Coherency Issues Write Back Write modified data from cache to memory only when necessary. Good, because doesn’t tie up memory bandwidth. Can have problems with various copies containing different values. Write Through Write modified data from cache to memory immediately. Not so good - uses a lot of memory bandwidth. Modified values always written to memory; data always matches. Advanced Computers Architecture, UNIT 4

16 What Does Coherency Mean?
Informally: “Any read must return the most recent write” Too strict and too difficult to implement Better: “Any write must eventually be seen by a read” All writes are seen in proper order (“serialization”) Two rules to ensure this: “If P writes x and P1 reads it, P’s write will be seen by P1 if the read and write are sufficiently far apart” Writes to a single location are serialized: seen in one order Latest write will be seen Otherwise could see writes in illogical order (could see older value after a newer value) Advanced Computers Architecture, UNIT 4

17 Advanced Computers Architecture, UNIT 4
Vector Computers Vector Processing Overview Vector Metrics, Terms Greater Efficiency than Super Scalar Processors Examples CRAY-1 (1976, 1979) 1st vector-register supercomputer Multimedia extensions to high-performance PC processors Modern multi-vector-processor supercomputer – NEC ESS Design Features of Vector Supercomputers Conclusions Advanced Computers Architecture, UNIT 4

18 Vector Programming Model
Scalar Registers r0 r15 Vector Registers v0 v15 [0] [1] [2] [VLRMAX-1] VLR Vector Length Register + [0] [1] [VLR-1] Vector Arithmetic Instructions ADDV v3, v1, v2 v3 v2 v1 v1 Vector Load and Store Instructions LV v1, r1, r2 Base, r1 Stride, r2 Memory Vector Register Advanced Computers Architecture, UNIT 4

19 Advanced Computers Architecture, UNIT 4
Vector Code Example # Scalar Code LI R4, 64 loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop # C code for (i=0; i<64; i++) C[i] = A[i] + B[i]; # Vector Code LI VLR, 64 LV V1, R1 LV V2, R2 ADDV.D V3, V1, V2 SV V3, R3 Advanced Computers Architecture, UNIT 4

20 Vector Arithmetic Execution
Use deep pipeline (=> fast clock) to execute element operations Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) Six stage multiply pipeline V3 <- v1 * v2 Advanced Computers Architecture, UNIT 4

21 Vector Instruction Set Advantages
Compact one short instruction encodes N operations => N*FlOp BandWidth Expressive, tells hardware that these N operations: are independent use the same functional unit access disjoint registers access registers in the same pattern as previous instructions access a contiguous block of memory (unit-stride load/store) OR access memory in a known pattern (strided load/store) Scalable can run same object code on more parallel pipelines or lanes Advanced Computers Architecture, UNIT 4

22 Properties of Vector Processors
Each result independent of previous result => long pipeline, compiler ensures no dependencies => high clock rate Vector instructions access memory with known pattern => highly interleaved memory => amortize memory latency of 64-plus elements => no (data) caches required! (but use instruction cache) Reduces branches and branch problems in pipelines Single vector instruction implies lots of work (≈ loop) => fewer instruction fetches Advanced Computers Architecture, UNIT 4

23 Advanced Computers Architecture, UNIT 4
Supercomputers Definition of a supercomputer: Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer Advanced Computers Architecture, UNIT 4

24 Supercomputer Applications
Typical application areas Military research (nuclear weapons, cryptography) Scientific research Weather forecasting Oil exploration Industrial design (car crash simulation) All involve huge computations on large data sets In 70s-80s, Supercomputer  Vector Machine Advanced Computers Architecture, UNIT 4

25 Scalar Unit + Vector Extensions
Vector Supercomputers Epitomized by Cray-1, 1976: Scalar Unit + Vector Extensions Load/Store Architecture Vector Registers Vector Instructions Hardwired Control Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory Advanced Computers Architecture, UNIT 4

26 Advanced Computers Architecture, UNIT 4
Cray-1 (1976) Advanced Computers Architecture, UNIT 4

27 Advanced Computers Architecture, UNIT 4
Vi V. Mask 64 Element Vector Registers Vj V. Length Vk Single Port Memory 16 banks of 64-bit words + 8-bit SECDED 80MW/sec data load/store 320MW/sec instruction buffer refill FP Add S0 S1 S2 S3 S4 S5 S6 S7 Sj FP Mul ( (Ah) + j k m ) Sk FP Recip Si (A0) 64 T Regs Si Int Add Int Logic Int Shift Pop Cnt Tjk A0 A1 A2 A3 A4 A5 A6 A7 ( (Ah) + j k m ) Aj Ai (A0) 64 B Regs Ak Addr Add Bjk Ai Addr Mul NIP CIP 64-bitx16 4 Instruction Buffers LIP memory bank cycle 50 ns processor cycle 12.5 ns (80MHz) Advanced Computers Architecture, UNIT 4

28 Advanced Computers Architecture, UNIT 4
Vector Memory System Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency Bank busy time: Cycles between accesses to same bank 1 2 3 4 5 6 7 8 9 A B C D E F + Base Stride Vector Registers Memory Banks Address Generator Advanced Computers Architecture, UNIT 4

29 Vector Instruction Execution
ADDV C,A,B C[1] C[2] C[0] A[3] B[3] A[4] B[4] A[5] B[5] A[6] B[6] Execution using one pipelined functional unit C[4] C[8] C[0] A[12] B[12] A[16] B[16] A[20] B[20] A[24] B[24] C[5] C[9] C[1] A[13] B[13] A[17] B[17] A[21] B[21] A[25] B[25] C[6] C[10] C[2] A[14] B[14] A[18] B[18] A[22] B[22] A[26] B[26] C[7] C[11] C[3] A[15] B[15] A[19] B[19] A[23] B[23] A[27] B[27] Execution using four pipelined functional units Advanced Computers Architecture, UNIT 4

30 History of Microprocessors
1950s IBM instituted a research program 1964 Release of System/360 Mid-1970s improved measurement tools demonstrated on CISC bit RISC microprocessor (801) developed led by Joel Birnbaum 1984 MIPS developed at Stanford, as well as projects done at Berkeley 1988 RISC processors had taken over high-end of the workstation market Early 1990s IBM’s POWER (Performance Optimization With Enhanced RISC) architecture introduced w/ the RISC System/6k AIM (Apple, IBM, Motorola) alliance formed, resulting in PowerPC Advanced Computers Architecture, UNIT 4

31 Advanced Computers Architecture, UNIT 4
What is CISC….? A complex instruction set computer (CISC, pronounced like "sisk") is a microprocessor instruction set architecture (ISA) in which each instruction can execute several low-level operations, such as a load from memory, an arithmetic operation, and a memory store, all in a single instruction. The philosophy behind it is, that hardware is always faster than software, therefore one should make a powerful instruction set, which provides programmers with assembly instructions to do a lot with short programs. So the primary goal of the CISC is to complete a task in few lines of assembly instruction as possible. Most common microprocessor designs such as the Intel 80x86 and Motorola 68K series followed the CISC philosophy. Advanced Computers Architecture, UNIT 4

32 Advanced Computers Architecture, UNIT 4
Memory in those days was expensive bigger program->more storage->more money Hence needed to reduce the number of instructions per program Number of instructions are reduced by having multiple operations within a single instruction Multiple operations lead to many different kinds of instructions that access memory In turn making instruction length variable and fetch-decode execute time unpredictable – making it more complex Thus hardware handles the complexity Advanced Computers Architecture, UNIT 4

33 Advanced Computers Architecture, UNIT 4
CISC philosophy Use microcode Used a simplified microcode instruction set to control the data path logic. This type of implementation is known as a microprogrammed implementation. Build rich instruction sets Consequences of using a microprogrammed design is that designers could build more functionality into each instruction. Build high-level instruction sets The logical next step was to build instruction sets which map directly from high-level languages Advanced Computers Architecture, UNIT 4

34 Characteristics of a CISC design
Register to register, register to memory, and memory to register commands. Uses Multiple addressing modes . Variable length instructions where the length often varies according to the addressing mode Instructions which require multiple clock cycles to execute. Advanced Computers Architecture, UNIT 4

35 Advanced Computers Architecture, UNIT 4
Addressing Modes Immediate Direct Indirect Register Register Indirect Displacement (Indexed) Stack Advanced Computers Architecture, UNIT 4

36 Advanced Computers Architecture, UNIT 4
Immediate Addressing Operand is part of instruction Operand = address field e.g. ADD 5 Add 5 to contents of accumulator 5 is operand No memory reference to fetch data Fast Limited range Instruction Opcode Operand Advanced Computers Architecture, UNIT 4

37 Advanced Computers Architecture, UNIT 4
Direct Addressing Address field contains address of operand Effective address (EA) = address field (A) e.g. ADD A Add contents of cell A to accumulator Look in memory at address A for operand Single memory reference to access data No additional calculations to work out effective address Limited address space Advanced Computers Architecture, UNIT 4

38 Direct Addressing Diagram
Instruction Opcode Address A Memory Operand 6

39 Advanced Computers Architecture, UNIT 4
Indirect Addressing Memory cell pointed to by address field contains the address of (pointer to) the operand EA = (A) Look in A, find address (A) and look there for operand e.g. ADD (A) Add contents of cell pointed to by contents of A to accumulator Large address space 2n where n = word length May be nested, multilevel, cascaded e.g. EA = (((A))) Multiple memory accesses to find operand Hence slower Advanced Computers Architecture, UNIT 4

40 Indirect Addressing Diagram
Instruction Opcode Address A Memory Pointer to operand Operand 9

41 Advanced Computers Architecture, UNIT 4
CISC Disadvantages Designers soon realised that the CISC philosophy had its own problems, including: Earlier generations of a processor family generally were contained as a subset in every new version - so instruction set & chip hardware become more complex with each generation of computers. So that as many instructions as possible could be stored in memory with the least possible wasted space, individual instructions could be of almost any length - this means that different instructions will take different amounts of clock time to execute, slowing down the overall performance of the machine. Many specialized instructions aren't used frequently enough to justify their existence -approximately 20% of the available instructions are used in a typical program. CISC instructions typically set the condition codes as a side effect of the instruction. Not only does setting the condition codes take time, but programmers have to remember to examine the condition code bits before a subsequent instruction changes them. Advanced Computers Architecture, UNIT 4

42 Advanced Computers Architecture, UNIT 4
Examples - CISC Examples of CISC processors are VAX PDP-11 Motorola family Intel x86/Pentium CPU’s Advanced Computers Architecture, UNIT 4


Download ppt "Multiprocessors Flynn's classification Vector computers"

Similar presentations


Ads by Google