Presentation on theme: "Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral."— Presentation transcript:
Multiprocessors Advanced Computers Architecture, UNIT 4 Flynn's classificationVector computersPipelining in Vector computersCrayMultiprocessor interconnectionGeneral purpose MultiprocessorData Flow Computers
The Big Picture: Where are We Now? Advanced Computers Architecture, UNIT 4 The major issue is this: Weve taken copies of the contents of main memory and put them in caches closer to the processors. But what happens to those copies if someone else wants to use the main memory data? How do we keep all copies of the data in synch with each other?
The Multiprocessor Picture Advanced Computers Architecture, UNIT 4 Processor/Memory Bus PCI Bus I/O Busses Example: Pentium System Organization
CS 284a, 7 October 97Copyright (c) , John Thornley4 Why Buy a Multiprocessor? Multiple users. Multiple applications. Multitasking within an application. Responsiveness and/or throughput. Advanced Computers Architecture, UNIT 4
5 Multiprocessor Architectures Message-Passing Architectures – Separate address space for each processor. – Processors communicate via message passing. Shared-Memory Architectures – Single address space shared by all processors. – Processors communicate by memory read/write. – SMP or NUMA. – Cache coherence is important issue. Advanced Computers Architecture, UNIT 4
CS 284a, 7 October 97Copyright (c) , John Thornley6 Message-Passing Architecture... processor cache memory processor cache memory processor cache memory interconnection network... Advanced Computers Architecture, UNIT 4
CS 284a, 7 October 97Copyright (c) , John Thornley7 Shared-Memory Architecture... interconnection network... processor 1 cache processor 2 cache processor N cache memory 1 memory M memory 2 Advanced Computers Architecture, UNIT 4
8 Shared-Memory Architecture: SMP and NUMA SMP = Symmetric Multiprocessor – All memory is equally close to all processors. – Typical interconnection network is a shared bus. – Easier to program, but doesnt scale to many processors. NUMA = Non-Uniform Memory Access – Each memory is closer to some processors than others. – a.k.a. Distributed Shared Memory. – Typically interconnection is grid or hypercube. – Harder to program, but scales to more processors. Advanced Computers Architecture, UNIT 4
Shared Memory Multiprocessor Advanced Computers Architecture, UNIT 4 Memory Disk & other IO Registers Caches Processor Registers Caches Processor Registers Caches Processor Registers Caches Processor Chipset Memory: centralized with Uniform Memory Access time (uma) and bus interconnect, I/O Examples: Sun Enterprise 6000, SGI Challenge, Intel SystemPro
Shared Memory Multiprocessor Advanced Computers Architecture, UNIT 4 Several processors share one address space –conceptually a shared memory –often implemented just like a multicomputer address space distributed over private memories Communication is implicit –read and write accesses to shared memory locations Synchronization –via shared memory locations spin waiting for non-zero –barriers P M Network/Bus PP Conceptual Model
Message Passing Multicomputers Advanced Computers Architecture, UNIT 4 Computers (nodes) connected by a network –Fast network interface Send, receive, barrier –Nodes not different than regular PC or workstation Cluster conventional workstations or PCs with fast network –cluster computing –Berkley NOW –IBM SP2 P M P M P M Network Node
Large-Scale MP Designs Advanced Computers Architecture, UNIT 4 Low Latency High Reliability 40 cycles 100 cycles Memory: distributed with nonuniform memory access time (numa) and scalable interconnect (distributed memory)
Shared Memory Architectures Advanced Computers Architecture, UNIT 4 In this section we will understand the issues around: Sharing one memory space among several processors. Maintaining coherence among several copies of a data item.
The Problem of Cache Coherency Advanced Computers Architecture, UNIT 4 CPU Cache A B Memory A B I/O a) Cache and memory coherent: A = A, B = B. CPU Cache A B Memory A B I/O Output of A gives 100 b) Cache and memory incoherent: A ^= A. CPU Cache A B Memory A B I/O Input 440 to B c) Cache and memory incoherent: B ^= B.
Some Simple Definitions Advanced Computers Architecture, UNIT 4 MechanismHow It WorksPerformanceCoherency Issues Write Back Write Through Write modified data from cache to memory only when necessary. Write modified data from cache to memory immediately. Good, because doesnt tie up memory bandwidth. Not so good - uses a lot of memory bandwidth. Can have problems with various copies containing different values. Modified values always written to memory; data always matches.
What Does Coherency Mean? Advanced Computers Architecture, UNIT 4 Informally: –Any read must return the most recent write –Too strict and too difficult to implement Better: –Any write must eventually be seen by a read –All writes are seen in proper order (serialization) Two rules to ensure this: –If P writes x and P1 reads it, Ps write will be seen by P1 if the read and write are sufficiently far apart –Writes to a single location are serialized: seen in one order Latest write will be seen Otherwise could see writes in illogical order (could see older value after a newer value)
Vector Computers Advanced Computers Architecture, UNIT 4 Vector Processing Overview Vector Metrics, Terms Greater Efficiency than Super Scalar Processors Examples –CRAY-1 (1976, 1979) 1st vector-register supercomputer –Multimedia extensions to high-performance PC processors –Modern multi-vector-processor supercomputer – NEC ESS Design Features of Vector Supercomputers Conclusions
Vector Arithmetic Execution Advanced Computers Architecture, UNIT 4 Use deep pipeline (=> fast clock) to execute element operations Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) V1V1 V2V2 V3V3 V3 <- v1 * v2 Six stage multiply pipeline
Vector Instruction Set Advantages Advanced Computers Architecture, UNIT 4 Compact –one short instruction encodes N operations => N*FlOp BandWidth Expressive, tells hardware that these N operations: –are independent –use the same functional unit –access disjoint registers –access registers in the same pattern as previous instructions –access a contiguous block of memory (unit-stride load/store) OR access memory in a known pattern (strided load/store) Scalable –can run same object code on more parallel pipelines or lanes
Properties of Vector Processors Advanced Computers Architecture, UNIT 4 Each result independent of previous result => long pipeline, compiler ensures no dependencies => high clock rate Vector instructions access memory with known pattern => highly interleaved memory => amortize memory latency of 64-plus elements => no (data) caches required! (but use instruction cache) Reduces branches and branch problems in pipelines Single vector instruction implies lots of work ( loop) => fewer instruction fetches
Supercomputers Advanced Computers Architecture, UNIT 4 Definition of a supercomputer: Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer
Supercomputer Applications Advanced Computers Architecture, UNIT 4 Typical application areas Military research (nuclear weapons, cryptography) Scientific research Weather forecasting Oil exploration Industrial design (car crash simulation) All involve huge computations on large data sets In 70s-80s, Supercomputer Vector Machine
Vector Supercomputers Advanced Computers Architecture, UNIT 4 Epitomized by Cray-1, 1976: Scalar Unit + Vector Extensions Load/Store Architecture Vector Registers Vector Instructions Hardwired Control Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory
Cray-1 (1976) Advanced Computers Architecture, UNIT 4
Single Port Memory 16 banks of 64- bit words + 8-bit SECDED 80MW/sec data load/store 320MW/sec instruction buffer refill 4 Instruction Buffers 64-bitx16 NIP LIP CIP (A 0 ) ( (A h ) + j k m ) 64 T Regs (A 0 ) ( (A h ) + j k m ) 64 B Regs S0 S1 S2 S3 S4 S5 S6 S7 A0 A1 A2 A3 A4 A5 A6 A7 SiSi T jk AiAi B jk FP Add FP Mul FP Recip Int Add Int Logic Int Shift Pop Cnt SjSj SiSi SkSk Addr Add Addr Mul AjAj AiAi AkAk memory bank cycle 50 ns processor cycle 12.5 ns (80MHz) V0 V1 V2 V3 V4 V5 V6 V7 VkVk VjVj ViVi V. Mask V. Length 64 Element Vector Registers
Vector Memory System Advanced Computers Architecture, UNIT ABCDEF + BaseStride Vector Registers Memory Banks Address Generator Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency Bank busy time: Cycles between accesses to same bank
Vector Instruction Execution Advanced Computers Architecture, UNIT 4 ADDV C,A,B C C C AB AB AB AB Execution using one pipelined functional unit C C C AB AB AB AB C C C AB AB AB AB C C C AB AB AB AB C C C AB AB AB AB Execution using four pipelined functional units
History of Microprocessors Advanced Computers Architecture, UNIT s IBM instituted a research program 1964 Release of System/360 Mid-1970s improved measurement tools demonstrated on CISC bit RISC microprocessor (801) developed led by Joel Birnbaum 1984 MIPS developed at Stanford, as well as projects done at Berkeley 1988 RISC processors had taken over high-end of the workstation market Early 1990s IBMs POWER (Performance Optimization With Enhanced RISC) architecture introduced w/ the RISC System/6k AIM (Apple, IBM, Motorola) alliance formed, resulting in PowerPC
What is CISC….? Advanced Computers Architecture, UNIT 4 A complex instruction set computer (CISC, pronounced like "sisk") is a microprocessor instruction set architecture (ISA) in which each instruction can execute several low-level operations, such as a load from memory, an arithmetic operation, and a memory store, all in a single instruction. The philosophy behind it is, that hardware is always faster than software, therefore one should make a powerful instruction set, which provides programmers with assembly instructions to do a lot with short programs. So the primary goal of the CISC is to complete a task in few lines of assembly instruction as possible. Most common microprocessor designs such as the Intel 80x86 and Motorola 68K series followed the CISC philosophy.
Advanced Computers Architecture, UNIT 4 Memory in those days was expensive bigger program->more storage->more money Hence needed to reduce the number of instructions per program Number of instructions are reduced by having multiple operations within a single instruction Multiple operations lead to many different kinds of instructions that access memory In turn making instruction length variable and fetch-decode execute time unpredictable – making it more complex Thus hardware handles the complexity
CISC philosophy Advanced Computers Architecture, UNIT 4 Use microcode Used a simplified microcode instruction set to control the data path logic. This type of implementation is known as a microprogrammed implementation. Build rich instruction sets Consequences of using a microprogrammed design is that designers could build more functionality into each instruction. Build high-level instruction sets The logical next step was to build instruction sets which map directly from high-level languages
Characteristics of a CISC design Advanced Computers Architecture, UNIT 4 Register to register, register to memory, and memory to register commands. Uses Multiple addressing modes. Variable length instructions where the length often varies according to the addressing mode Instructions which require multiple clock cycles to execute.
Addressing Modes Advanced Computers Architecture, UNIT 4 Immediate Direct Indirect Register Register Indirect Displacement (Indexed) Stack
Immediate Addressing Advanced Computers Architecture, UNIT 4 Operand is part of instruction Operand = address field e.g. ADD 5 Add 5 to contents of accumulator 5 is operand No memory reference to fetch data Fast Limited range Operand Opcode Instruction
Direct Addressing Advanced Computers Architecture, UNIT 4 Address field contains address of operand Effective address (EA) = address field (A) e.g. ADD A Add contents of cell A to accumulator Look in memory at address A for operand Single memory reference to access data No additional calculations to work out effective address Limited address space
Direct Addressing Diagram Address AOpcode Instruction Memory Operand
Indirect Addressing Advanced Computers Architecture, UNIT 4 Memory cell pointed to by address field contains the address of (pointer to) the operand EA = (A) Look in A, find address (A) and look there for operand e.g. ADD (A) Add contents of cell pointed to by contents of A to accumulator Large address space 2 n where n = word length May be nested, multilevel, cascaded e.g. EA = (((A))) Multiple memory accesses to find operand Hence slower
CISC Disadvantages Advanced Computers Architecture, UNIT 4 Designers soon realised that the CISC philosophy had its own problems, including: Earlier generations of a processor family generally were contained as a subset in every new version - so instruction set & chip hardware become more complex with each generation of computers. So that as many instructions as possible could be stored in memory with the least possible wasted space, individual instructions could be of almost any length - this means that different instructions will take different amounts of clock time to execute, slowing down the overall performance of the machine. Many specialized instructions aren't used frequently enough to justify their existence -approximately 20% of the available instructions are used in a typical program. CISC instructions typically set the condition codes as a side effect of the instruction. Not only does setting the condition codes take time, but programmers have to remember to examine the condition code bits before a subsequent instruction changes them.
Examples - CISC Advanced Computers Architecture, UNIT 4 Examples of CISC processors are VAX PDP-11 Motorola family Intel x86/Pentium CPUs