July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures.

July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures

July 2004Computer Architecture, Advanced ArchitecturesSlide 2 About This Presentation This presentation is intended to support the use of the textbook Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005, ISBN 0-19-515455-X. It is updated regularly by the author as part of his teaching of the upper-division course ECE 154, Introduction to Computer Architecture, at the University of California, Santa Barbara. Instructors can use these slides freely in classroom teaching and for other educational purposes. Any other use is strictly prohibited. © Behrooz Parhami EditionReleasedRevised First July 2003July 2004July 2005

July 2004Computer Architecture, Advanced ArchitecturesSlide 3 VII Advanced Architectures Topics in This Part Chapter 25 Road to Higher Performance Chapter 26 Vector and Array Processing Chapter 27 Shared-Memory Multiprocessing Chapter 28 Distributed Multicomputing Performance enhancement beyond what we have seen: What else can we do at the instruction execution level? Data parallelism: vector and array processing Control parallelism: parallel and distributed processing

July 2004Computer Architecture, Advanced ArchitecturesSlide 4 25 Road to Higher Performance Review past, current, and future architectural trends: General-purpose and special-purpose acceleration Introduction to data and control parallelism Topics in This Chapter 25.1 Past and Current Performance Trends 25.2 Performance-Driven ISA Extensions 25.3 Instruction-Level Parallelism 25.4 Speculation and Value Prediction 25.5 Special-Purpose Hardware Accelerators 25.6 Vector, Array, and Parallel Processing

July 2004Computer Architecture, Advanced ArchitecturesSlide 5 25.1 Past and Current Performance Trends Architectural method Improvement factor 1. Pipelining (and superpipelining)3-8 √ 2. Cache memory, 2-3 levels2-5 √ 3. RISC and related ideas 2-3 √ 4. Multiple instruction issue (superscalar)2-3 √ 5. ISA extensions (e.g., for multimedia) 1-3 √ 6. Multithreading (super-, hyper-) 2-5 ? 7. Speculation and value prediction 2-3 ? 8. Hardware acceleration2-10 ? 9. Vector and array processing2-10 ? 10. Parallel/distributed computing 2-1000s ? Established methods Newer methods Previously discussed Covered in Part VII Available computing power ca. 2000: GFLOPS on desktop TFLOPS in supercomputer center PFLOPS on drawing board Computer performance grew by a factor of about 10000 between 1980 and 2000 100 due to faster technology 100 due to better architecture

July 2004Computer Architecture, Advanced ArchitecturesSlide 6 Peak Performance of Supercomputers PFLOPS TFLOPS GFLOPS 1980 200019902010 Earth Simulator ASCI White Pacific ASCI Red Cray T3D TMC CM-5 TMC CM-2 Cray X-MP Cray 2  10 / 5 years Dongarra, J., “Trends in High Performance Computing,” Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04]

July 2004Computer Architecture, Advanced ArchitecturesSlide 7 Energy Consumption is Getting out of Hand Figure 25.1 Trend in energy consumption for each MIPS of computational power in general-purpose processors and DSPs.

July 2004Computer Architecture, Advanced ArchitecturesSlide 8 Energy Reduction Methods Clock gating –Turn off clock signals that go to unused parts of a chip Asynchronous design –Doing away with clocking and the attendant energy waste Voltage reduction –Slower but uses less energy

July 2004Computer Architecture, Advanced ArchitecturesSlide 9 Examples Intel XScale Processor –Runs at 100 MHz to 1 gHz –30 mW of power System on a chip –Wearable computers –Augmented reality

July 2004Computer Architecture, Advanced ArchitecturesSlide 10 Examples Power = k * voltage squared –Use two low-voltage processors –Chip multiprocessors Cell phones –Two processors One low power but computationally weak Digital signal processor

July 2004Computer Architecture, Advanced ArchitecturesSlide 11 25.2 Performance-Driven ISA Extensions Adding instructions that do more work per cycle Shift-add: replace two instructions with one (e.g., multiply by 5) Multiply-add: replace two instructions with one (x := c + a  b) Multiply-accumulate: reduce round-off error (s := s + a  b) Conditional copy: to avoid some branches (e.g., in if-then-else) Subword parallelism (for multimedia applications) Intel MMX: multimedia extension 64-bit registers can hold multiple integer operands Intel SSE: Streaming SIMD extension 128-bit registers can hold several floating-point operands

July 2004Computer Architecture, Advanced ArchitecturesSlide 12 Multimedia Extensions Workload shifts from data crunching to multimedia New instructions Compatibility issues and costs lead to compromises “that prevented the full benefits of such extensions from materializing”

July 2004Computer Architecture, Advanced ArchitecturesSlide 13 New Instructions Subword operands –8-bit pixel, 16-bit audio samples –MMX added 57 instructions to allow parallel subword arithmetic operations

July 2004Computer Architecture, Advanced ArchitecturesSlide 14 MMX 8 floating-point registers that “double” as the integer register file (cost) A 64-bit register can be viewed as holding a single 64-bit integer or a vector of 2, 4, or 8 narrower integer operands

July 2004Computer Architecture, Advanced ArchitecturesSlide 15 Parallel Pack Unpacking –8-vectors, low Xxxxabcd, xxxxefgh -> aebfcgdh Packing does the reverse These operations increae precision

July 2004Computer Architecture, Advanced ArchitecturesSlide 16 Saturating Arithmetic With ordinary arithmetic overflow results in a sum that is smaller than either operand. Ie, 0b1100 + 0b1011 = 0b0111 This could result in a bright spot in a dark region Special rules to deal with “infinity”

July 2004Computer Architecture, Advanced ArchitecturesSlide 17 Intel MMX ISA Exten- sion Table 25.1 ClassInstructionVectorOp typeFunction or results Copy Register copy 32 bits Integer register  MMX register Parallel pack 4, 2 Saturate Convert to narrower elements Parallel unpack low 8, 4, 2 Merge lower halves of 2 vectors Parallel unpack high 8, 4, 2 Merge upper halves of 2 vectors Arith- metic Parallel add 8, 4, 2 Wrap/Saturate # Add; inhibit carry at boundaries Parallel subtract 8, 4, 2 Wrap/Saturate # Subtract with carry inhibition Parallel multiply low 4 Multiply, keep the 4 low halves Parallel multiply high 4 Multiply, keep the 4 high halves Parallel multiply-add 4 Multiply, add adjacent products* Parallel compare equal 8, 4, 2 All 1s where equal, else all 0s Parallel compare greater 8, 4, 2 All 1s where greater, else all 0s Shift Parallel left shift logical 4, 2, 1 Shift left, respect boundaries Parallel right shift logica l 4, 2, 1 Shift right, respect boundaries Parallel right shift arith 4, 2 Arith shift within each (half)word Logic Parallel AND 1 Bitwise dest  (src1)  (src2) Parallel ANDNOT 1 Bitwise dest  (src1)  (src2) Parallel OR 1 Bitwise dest  (src1)  (src2) Parallel XOR 1 Bitwise dest  (src1)  (src2) Memory access Parallel load MMX reg 32 or 64 bits Address given in integer register Parallel store MMX reg 32 or 64 bit Address given in integer register Control Empty FP tag bits Required for compatibility $

July 2004Computer Architecture, Advanced ArchitecturesSlide 18 MMX Multiplication and Multiply-Add Figure 25.2 Parallel multiplication and multiply-add in MMX.

July 2004Computer Architecture, Advanced ArchitecturesSlide 19 MMX Parallel Comparisons Figure 25.3 Parallel comparisons in MMX.

July 2004Computer Architecture, Advanced ArchitecturesSlide 20 25.3 ILP Under Ideal Conditions! Figure 25.4 Available instruction-level parallelism and the speedup due to multiple instruction issue in superscalar processors [John91].

July 2004Computer Architecture, Advanced ArchitecturesSlide 21 ILP Inefficiencies Diminishing returns due to –Slowing down instruction issue due to extreme circuit complexity –Wasting resources on instructions that will not be executed due to branching

July 2004Computer Architecture, Advanced ArchitecturesSlide 22 Instruction-Level Parallelism Figure 25.5 A computation with inherent instruction-level parallelism.

July 2004Computer Architecture, Advanced ArchitecturesSlide 23 VLIW and EPIC Architectures Figure 25.6 Hardware organization for IA-64. General and floating- point registers are 64-bit wide. Predicates are single-bit registers. VLIWVery long instruction word architecture EPICExplicitly parallel instruction computing

July 2004Computer Architecture, Advanced ArchitecturesSlide 24 A Better Approach? Allow the programmer or compiler to specify which operations are capable of being executed at the same time More reasonable than hiding the concurrency in sequential code and then requiring the hardware to rediscover it

July 2004Computer Architecture, Advanced ArchitecturesSlide 25 Itanium Long instructions “bundle” three conventional RISC-type instructions, “syllables”, of 41 bits, plus a 5 bit “template” holding scheduling information The template specifies the instruction types and indicates concurrency

July 2004Computer Architecture, Advanced ArchitecturesSlide 26 41-bit Instruction 4 bit major opcode 6 bit predicate specification 31 bits for operands and modifiers –3 registers can be specified (21 bits) The results are committed only if the specified predicate is 1 –Predicated instruction execution

July 2004Computer Architecture, Advanced ArchitecturesSlide 27 Additional Features Control and data speculation Software pipelining for loops Threading –Interleaved multithreading –Blocked multithreading –Simultaneous multithreading

July 2004Computer Architecture, Advanced ArchitecturesSlide 28 Interleaved Multithreading Instruction logic picks instructions from the different threads in a round-robin fashion –Spaces out the instructions from the same thread avoiding bubbles due to data or control dependencies

July 2004Computer Architecture, Advanced ArchitecturesSlide 29 Blocked multithreading Deals with only one thread until it hits a snag, eg, cache miss Fills the long stalls of one thread with useful instructions from another thread

July 2004Computer Architecture, Advanced ArchitecturesSlide 30 Simultaneous Multithreading Most flexible strategy Allows mixing and matching of instructions from different threads to make the best use of available resources

July 2004Computer Architecture, Advanced ArchitecturesSlide 31 Speculation and Value Prediction Itanium –Control speculation One load becomes two –A speculative load (earlier) –A checking instruction We may execute a load instruction that will not be encountered due to a branch or a false predicate The speculative load keeps a record of an exception

July 2004Computer Architecture, Advanced ArchitecturesSlide 32 Data Speculation Allows a load instruction to be moved ahead of a store When the speculative load is performed, the address is recorded and checked against subsequent addresses for write operations

July 2004Computer Architecture, Advanced ArchitecturesSlide 33 Other Forms of Speculation Issuing instructions from both paths a branch instruction Add microarchitecture registers for temporary values Value prediction –Store the operands and results of previous divide instructions in a memo table

July 2004Computer Architecture, Advanced ArchitecturesSlide 34 Continued Instruction reuse Superspeculation –Predict the outcome of a sequence of instructions –Take into account patterns 50 – 98% Predict the next address in a load (r2)

July 2004Computer Architecture, Advanced ArchitecturesSlide 35 25.4 Speculation and Value Prediction Figure 25.7 Examples of software speculation in IA-64.

July 2004Computer Architecture, Advanced ArchitecturesSlide 36 Value Prediction Figure 25.8 Value prediction for multiplication or division via a memo table.

July 2004Computer Architecture, Advanced ArchitecturesSlide 37 25.5 Special-Purpose Hardware Accelerators Figure 25.9 General structure of a processor with configurable hardware accelerators.

July 2004Computer Architecture, Advanced ArchitecturesSlide 38 Graphic Processors, Network Processors, etc. Figure 25.10 Simplified block diagram of Toaster2, Cisco Systems’ network processor. PE 5

July 2004Computer Architecture, Advanced ArchitecturesSlide 39 Network Processor Arriving traffic overwhelms a processor Network processors have specialized instruction sets and other capabilities –Instructions for bit-mapping, tree search, and cyclic redundancy check calculations –Instruction-level parallelism, micro engines

July 2004Computer Architecture, Advanced ArchitecturesSlide 40 25.6 Vector, Array, and Parallel Processing Figure 25.11 The Flynn-Johnson classification of computer systems.

July 2004Computer Architecture, Advanced ArchitecturesSlide 41 SIMD Architectures Data parallelism: executing one operation on multiple data streams Concurrency in time – vector processing Concurrency in space – array processing Example to provide context Multiplying a coefficient vector by a data vector (e.g., in filtering) y[i] := c[i]  x[i], 0  i < n Sources of performance improvement in vector processing (details in the first half of Chapter 26) One instruction is fetched and decoded for the entire operation The multiplications are known to be independent (no checking) Pipelining/concurrency in memory access as well as in arithmetic Array processing is similar (details in the second half of Chapter 26)

July 2004Computer Architecture, Advanced ArchitecturesSlide 42 MISD Architecture Example Figure 25.12 Multiple instruction streams operating on a single data stream (MISD).

July 2004Computer Architecture, Advanced ArchitecturesSlide 43 MIMD Architectures Control parallelism: executing several instruction streams in parallel GMSV: Shared global memory – symmetric multiprocessors DMSV: Shared distributed memory – asymmetric multiprocessors DMMP: Message passing – multicomputers Figure 27.1 Centralized shared memory.Figure 28.1 Distributed memory....

July 2004Computer Architecture, Advanced ArchitecturesSlide 44 Amdahl’s Law Revisited Figure 4.4 Amdahl’s law: speedup achieved if a fraction f of a task is unaffected and the remaining 1 – f part runs p times as fast. s =  min(p, 1/f) 1 f + (1 – f)/p f = sequential fraction p = speedup of the rest with p processors

July 2004Computer Architecture, Advanced ArchitecturesSlide 45 26 Vector and Array Processing Single instruction stream operating on multiple data streams Data parallelism in time = vector processing Data parallelism in space = array processing Topics in This Chapter 26.1 Operations on Vectors 26.2 Vector Processor Implementation 26.3 Vector Processor Performance 26.4 Shared-Control Systems 26.5 Array Processor Implementation 26.6 Array Processor Performance

July 2004Computer Architecture, Advanced ArchitecturesSlide 46 26.1 Operations on Vectors Sequential processor: for i = 0 to 63 do P[i] := W[i]  D[i] endfor Vector processor: load W load D P := W  D store P for i = 0 to 63 do X[i+1] := X[i] + Z[i] Y[i+1] := X[i+1] + Y[i] endfor Unparallelizable

July 2004Computer Architecture, Advanced ArchitecturesSlide 47 First Example Overlap load and store Pipeline chaining –http://www.pcc.qub.ac.uk/tec/courses/cray/ stu-notes/CRAY-completeMIF_3.html

July 2004Computer Architecture, Advanced ArchitecturesSlide 48 Second Example For (i=0;i<64;i++) –{ A[i]=A[i] + B[i]; B[i+1]= C[i] + D[i]; –} Data dependency Exercise

July 2004Computer Architecture, Advanced ArchitecturesSlide 49 Useful Vector Operations A = c * B A = B * B s = summation(A)

July 2004Computer Architecture, Advanced ArchitecturesSlide 50 Three Key Features Deeply pipelined units for high throughput Plenty of long vector registers High-bandwidth data transfer paths between memory and vector registers Helpful: pipeline chaining –Forwarding result values from one unit to another

July 2004Computer Architecture, Advanced ArchitecturesSlide 51 26.2 Vector Processor Implementation Figure 26.1 Simplified generic structure of a vector processor.

July 2004Computer Architecture, Advanced ArchitecturesSlide 52 The Memory Bandwidth Problem Page 495 –Most challenging aspect –No cache –Memory is highly interleaved (17.4) 64-256 memory banks are not unusual For peak performance a memory bank must not be accessed until all other memory banks have been accessed

July 2004Computer Architecture, Advanced ArchitecturesSlide 53 Continued No problem for sequential, stride-1, access Any stride that is relatively prime with the interleaving width is no problem Unfortunately, … Eg, 64 x 64 matrix stored in row-major order accessed by columns

July 2004Computer Architecture, Advanced ArchitecturesSlide 54 Continued Skewed storage scheme –Elements of row i are rotated to the right by i positions –Stride becomes 65 –Adds small overhead due to a more complex translation

July 2004Computer Architecture, Advanced ArchitecturesSlide 55 Conflict-Free Memory Access Figure 26.2 Skewed storage of the elements of a 64  64 matrix for conflict-free memory access in a 64-way interleaved memory. Elements of column 0 are highlighted in both diagrams.

July 2004Computer Architecture, Advanced ArchitecturesSlide 56 Vector machine instructions Similar in spirit to MMX instructions –MMX vectors are short and fixed in length –Vector machine vectors are long and variable in length (parameter) Typical instructions (non-arithmetic) –Vector load and store with implicit or explicitly specified stride

July 2004Computer Architecture, Advanced ArchitecturesSlide 57 Continued Vector load and store with memory addresses specified in an index vector Vector comparisons on an element by element basis Data copying among registers including vector-length and mask registers –Mask registers allow selectivity

July 2004Computer Architecture, Advanced ArchitecturesSlide 58 Overlapped Memory Access and Computation Figure 26.3 Vector processing via segmented load/store of vectors in registers in a double-buffering scheme. Solid (dashed) lines show data flow in the current (next) segment.

July 2004Computer Architecture, Advanced ArchitecturesSlide 59 Vector Processor Performance Ideal performance is only theoretical Start-up time –Becomes negligible only when vectors are very long –A few to many tens of cycles –Half-performance vector length Performance is half that of infinite vectors Tens to a hundred

July 2004Computer Architecture, Advanced ArchitecturesSlide 60 Continued Resource idle time and contention S = V * W + X * Y –With one multiply unit the adder will go unused Memory bank contention –Severe when vector elements are accessed in irregular positions (index vectors)

July 2004Computer Architecture, Advanced ArchitecturesSlide 61 Vector segmentation –Entire vector does not fit in a vector register –A vector of length 65! Conditional computations –Mask determines selection

July 2004Computer Architecture, Advanced ArchitecturesSlide 62 Conditional Computation Example: if (A(i)) … else … –Say the mask is 110101, ie, true for 0, 1, 3, 5 –Two stages First stage 0, 1, 3, 5 execute the if part Second stage 2 and 4 execute the then part –Computation time roughly doubles Switch statement?

July 2004Computer Architecture, Advanced ArchitecturesSlide 63 Multilane Vector Processors 4 lanes –Lane 0 has 0, 4, 8, … –Four pipelined adders operate on the registers in parallel Corresponds to multiple lanes on a highway

July 2004Computer Architecture, Advanced ArchitecturesSlide 64 26.3 Vector Processor Performance Figure 26.4 Total latency of the vector computation S := X  Y + Z, without and with pipeline chaining.

July 2004Computer Architecture, Advanced ArchitecturesSlide 65 Performance as a Function of Vector Length Figure 26.5 The per-element execution time in a vector processor as a function of the vector length.

July 2004Computer Architecture, Advanced ArchitecturesSlide 66 26.4 Shared-Control Systems Figure 26.6 From completely shared control to totally separate controls.

July 2004Computer Architecture, Advanced ArchitecturesSlide 67 Simple Shared-Control SIMD Array processors –Centralized control –Memory and processor are distributed in space

July 2004Computer Architecture, Advanced ArchitecturesSlide 68 Example Array Processor Figure 26.7 Array processor with 2D torus interprocessor communication network.

July 2004Computer Architecture, Advanced ArchitecturesSlide 69 Memory Bandwidth Ceases to be a major bottleneck A vector of length m can be distributed to the local memories of p processors with each one receiving m/p elements 64-256 very fast processors 64K processors

July 2004Computer Architecture, Advanced ArchitecturesSlide 70 Downside Vector operations that require interprocessor communication –Summation –X[i] * Y[m-1-i] Processor i needs information from processor m-1-i

July 2004Computer Architecture, Advanced ArchitecturesSlide 71 Pattern of Connectivity Linear array (ring) –Each processor is connected to two neighbors Highly sophisticated networks Square or rectangular 2D mesh –Torus involves connecting the endpoints of rows and columns

July 2004Computer Architecture, Advanced ArchitecturesSlide 72 Array Processor Implementation Assume 2D torus interconnection Switches allow data to flow in the row loops clockwise or counterclockwise directions Control unit resembles a scalar processor –Fetch, decode, broadcast control signals

July 2004Computer Architecture, Advanced ArchitecturesSlide 73 Continued Register file ALU Data memory New: branching based on the state of the processor array –Single bit of information from each processor

July 2004Computer Architecture, Advanced ArchitecturesSlide 74 Continued Array state register –Population count instruction Processors within the array –Bare-bone No fetch, no decode –Processing elements (PEs) Register file, ALU, memory

July 2004Computer Architecture, Advanced ArchitecturesSlide 75 Continued Some local autonomy –Say we load using r2 as a pointer –Different processors may have different values of r2

July 2004Computer Architecture, Advanced ArchitecturesSlide 76 26.5 Array Processor Implementation Figure 26.8 Handling of interprocessor communication via a mechanism similar to data forwarding.

July 2004Computer Architecture, Advanced ArchitecturesSlide 77 Configuration Switches Figure 26.9 I/O switch states in the array processor of Figure 26.7. Figure 26.7

July 2004Computer Architecture, Advanced ArchitecturesSlide 78 Array Processor Performance Signal travel time Embarrassingly Parallel –Signal and image processing –Database management and data mining –Data fusion in command and control Data distribution –Which processor has what memory?

July 2004Computer Architecture, Advanced ArchitecturesSlide 79 26.6 Array Processor Performance Array processors perform well for the same class of problems that are suitable for vector processors For embarrassingly (pleasantly) parallel problems, array processors can be faster and more energy-efficient than vector processors A criticism of array processing: For conditional computations, a significant part of the array remains idle while the “then” part is performed; subsequently, idle and busy processors reverse roles during the “else” part However: Considering array processors inefficient due to idle processors is like criticizing mass transportation because many seats are unoccupied most of the time It’s the total cost of computation that counts, not hardware utilization!

July 2004Computer Architecture, Advanced ArchitecturesSlide 80 27 Shared-Memory Multiprocessing Multiple processors sharing a memory unit seems naïve Didn’t we conclude that memory is the bottleneck? How then does it make sense to share the memory? Topics in This Chapter 27.1 Centralized Shared Memory 27.2 Multiple Caches and Cache Coherence 27.3 Implementing Symmetric Multiprocessors 27.4 Distributed Shared Memory 27.5 Directories to Guide Data Access 27.6 Implementing Asymmetric Multiprocessors

July 2004Computer Architecture, Advanced ArchitecturesSlide 81 Robert Tomasulo Looking forward, 1998 Technology advances in both hardware and software have rendered most Instruction Set Architecture conflicts moot. … Computer design is focusing more on the memory bandwidth and multiprocessing interconnection problems.

July 2004Computer Architecture, Advanced ArchitecturesSlide 82 Parallel Processing as a Topic of Study Graduate course ECE 254B: Adv. Computer Architecture – Parallel Processing An important area of study that allows us to overcome fundamental speed limits Our treatment of the topic is quite brief (Chapters 26-27)

July 2004Computer Architecture, Advanced ArchitecturesSlide 83 Memory It is the speed of memory, not the processor, that limits performance. Why does it make sense to expect a single memory unit to keep up with several processors? Provide a private cache for each processor. Cache consistency!

July 2004Computer Architecture, Advanced ArchitecturesSlide 84 27.1 Centralized Shared Memory Figure 27.1 Structure of a multiprocessor with centralized shared-memory.

July 2004Computer Architecture, Advanced ArchitecturesSlide 85 Two Extremes A single shared bus A permutation network that allows any processor to read or write any memory module. Most networks fall between these extremes.

July 2004Computer Architecture, Advanced ArchitecturesSlide 86 Processor-to-memory connection Most critical component Must support conflict-free and high- speed access to memory by several processors at once Bandwidth is more important than latency due to multithreading and other methods that allow processors to do useful work while waiting for data. LT

July 2004Computer Architecture, Advanced ArchitecturesSlide 87 Multistage Interconnection Network Multiple paths Large switches arranged in layers Cost/performance trade-offs One of the oldest and most versatile is the butterfly network Self-routing network

July 2004Computer Architecture, Advanced ArchitecturesSlide 88 Butterfly Network 2^q rows and q+1 columns Each switch in row r has a connection to the switch in the same row and a cross connection to row r +/- 2^c, where c is the column number Unique path can be determined from i and j (in binary)

July 2004Computer Architecture, Advanced ArchitecturesSlide 89 Example Say we want to go from row 2 to row 4, 010 to 100. XOR the two numbers getting 110. Read that routing tag backwards. This is not a permutation network.

July 2004Computer Architecture, Advanced ArchitecturesSlide 90 Processor-to-Memory Interconnection Network Figure 27.2 Butterfly and the related Beneš network as examples of processor-to-memory interconnection network in a multiprocessor.

July 2004Computer Architecture, Advanced ArchitecturesSlide 91 Processor-to-Memory Interconnection Network Figure 27.3 Interconnection of eight processors to 256 memory banks in Cray Y-MP, a supercomputer with multiple vector processors.

July 2004Computer Architecture, Advanced ArchitecturesSlide 92 Shared-Memory Programming: Broadcasting Copy B[0] into all B[i] so that multiple processors can read its value without memory access conflicts for k = 0 to  log 2 p  – 1 processor j, 0  j < p, do B[j + 2 k ] := B[j] endfor Recursive doubling

July 2004Computer Architecture, Advanced ArchitecturesSlide 93 Shared-Memory Programming: Summation Sum reduction of vector X processor j, 0  j < p, do Z[j] := X[j] s := 1 while s < p processor j, 0  j < p – s, do Z[j + s] := X[j] + X[j + s] s := 2  s endfor Recursive doubling

July 2004Computer Architecture, Advanced ArchitecturesSlide 94 Drawbacks Hot-spot contention Shared data that are continually being accessed by many processors Interconnection becomes bottleneck

July 2004Computer Architecture, Advanced ArchitecturesSlide 95 Solution? Reduce the frequency of memory accesses to main memory by equipping each processor with a private cache A hit rate of 95% reduces traffic through the processor-to-memory network to 5%

July 2004Computer Architecture, Advanced ArchitecturesSlide 96 27.2 Multiple Caches and Cache Coherence Private processor caches reduce memory access traffic through the interconnection network but lead to challenging consistency problems.

July 2004Computer Architecture, Advanced ArchitecturesSlide 97 Status of Data Copies Figure 27.4 Various types of cached data blocks in a parallel processor with centralized main memory and private processor caches.

July 2004Computer Architecture, Advanced ArchitecturesSlide 98 Cache Coherence Algotithms Designate each cache line as exclusive or shared A processor may only modify an exclusive cache line To modify a shared line, change its status to exclusive which invalidates all copies in other caches.

July 2004Computer Architecture, Advanced ArchitecturesSlide 99 Continued This involves broadcasting the intention to modify the line This is called a write-invalidate algorithm

July 2004Computer Architecture, Advanced ArchitecturesSlide 100 Snoopy Protocol Particular implementation of write- invalidate Requires a shared bus Snoop control lines monitor the bus Each cache line is shared, exclusive, or invalid

July 2004Computer Architecture, Advanced ArchitecturesSlide 101 Continued Transitions –CPU read hits are no problem –CPU write hits Exclusive? No change Shared state? Change state to exclusive and a write miss is signaled to other caches, causing them to invalidate their copies

July 2004Computer Architecture, Advanced ArchitecturesSlide 102 Continued CPU write miss –Exclusive state? Write the line to memory and change state to invalid A request for access to an exclusive cache line forces the owner to make sure that stale data from main memory is not used.

July 2004Computer Architecture, Advanced ArchitecturesSlide 103 A Snoopy Cache Coherence Protocol Figure 27.5 Finite-state control mechanism for a bus-based snoopy cache coherence protocol with write-back caches. P C P C P C P C Bus Memory

July 2004Computer Architecture, Advanced ArchitecturesSlide 104 27.3 Implementing Symmetric Multiprocessors Figure 27.6 Structure of a generic bus-based symmetric multiprocessor.

July 2004Computer Architecture, Advanced ArchitecturesSlide 105 Bus Bandwidth Limits Performance Example 27.1 Consider a shared-memory multiprocessor built around a single bus with a data bandwidth of x GB/s. Instructions and data words are 4 B wide, each instruction requires access to an average of 1.4 memory words (including the instruction itself). The combined hit rate for caches is 98%. Compute an upper bound on the multiprocessor performance in GIPS. Address lines are separate and do not affect the bus data bandwidth. Solution Executing an instruction implies a bus transfer of 1.4  0.02  4 = 0.112 B. Thus, an absolute upper bound on performance is x/0.112 = 8.93x GIPS. Assuming a bus width of 32 B, no bus cycle or data going to waste, and a bus clock rate of y GHz, the performance bound becomes 286y GIPS. This bound is highly optimistic. Buses operate in the range 0.1 to 1 GHz. Thus, a performance level approaching 1 TIPS (perhaps even ¼ TIPS) is beyond reach with this type of architecture.

July 2004Computer Architecture, Advanced ArchitecturesSlide 106 Split-transaction Protocol The bus is released between the transmission of memory address and the return of data. There may be several outstanding memory requests in progress at the same time.

July 2004Computer Architecture, Advanced ArchitecturesSlide 107 Snoop-based Coherence Assume one level of cache for each processor Atomic bus –Total order

July 2004Computer Architecture, Advanced ArchitecturesSlide 108 Implementing Snoopy Caches Figure 27.7 Main structure for a snoop-based cache coherence algorithm.

July 2004Computer Architecture, Advanced ArchitecturesSlide 109 Example Two processors issue requests for writing into a shared cache line –Change its state from shared to exclusive The bus arbiter will allow one to go first The other processor must then invalidate the line and issue a different request (write miss)

July 2004Computer Architecture, Advanced ArchitecturesSlide 110 Continued Include intermediate states in the state diagram Shared/exclusive intermediate state Outcome of the request takes you either to exclusive or invalid

July 2004Computer Architecture, Advanced ArchitecturesSlide 111 Two Useful Mechanisms Locking of shared resources –Atomic instructions –Two processors both read 0 and then write 1 Barrier synchronizations –Counter (stop until it gets to p) Obvious contention –Logical and

July 2004Computer Architecture, Advanced ArchitecturesSlide 112 Distributed Shared Memory Scalability problem! Does sharing of a global address space require that the memory be centralized? No. NUMA (nonuniform memory access) Reference to a non-local address initiates a remote access. DMA controllers.

July 2004Computer Architecture, Advanced ArchitecturesSlide 113 27.4 Distributed Shared Memory Figure 27.8 Structure of a distributed shared-memory multiprocessor.

July 2004Computer Architecture, Advanced ArchitecturesSlide 114 Sequential Consistency The example in the previous slide is nondeterministic. Everything depends on the order of execution. The only expectation is that y = -1 will complete before z = 1 Any processor that sees the new value for z will see the new value for y

July 2004Computer Architecture, Advanced ArchitecturesSlide 115 Continued Ensuring sequential consistency is not easy and comes with a performance penalty. It is insufficient to ensure that y changes before z changes. –A processor with different latencies to y and z may see these events in the wrong order. –Astronomer example.

July 2004Computer Architecture, Advanced ArchitecturesSlide 116 Directories Problem: dynamic data distribution Management overhead –Keeping track of the whereabouts of various data elements to be able to find them when needed

July 2004Computer Architecture, Advanced ArchitecturesSlide 117 Assumptions Each data line has a statically assigned home location within the memory modules but may have multiple cached copies at various locations Associated with each data element is a directory entry that specifies which caches currently hold a copy of the data (the sharing set)

July 2004Computer Architecture, Advanced ArchitecturesSlide 118 Continued If the number of processors is small, a bit vector is quite efficient. –32 processors 32 bit word Provide a listing of the caches –Efficient when sharing sets are small Neither method is scalable

July 2004Computer Architecture, Advanced ArchitecturesSlide 119 A Scalable Method Scalable Coherent Interface (SCI) standard The directory entry specifies one of the caches holding a copy, with each copy containing a pointer to the next cache copy. Distribute directory scheme

July 2004Computer Architecture, Advanced ArchitecturesSlide 120 27.5 Directories to Guide Data Access Figure 27.9 Distributed shared-memory multiprocessor with a cache, directory, and memory module associated with each processor.

July 2004Computer Architecture, Advanced ArchitecturesSlide 121 Triggering Events Messages sent by cache memories to the directory holding the requested line Cache c indicates a read miss –Shared: data sent to c, c added to SS –Exclusive: fetch to the owner, data to c, c added to SS –Uncached: data to c, sharing set is {c} Line is shared

July 2004Computer Architecture, Advanced ArchitecturesSlide 122 Continued Cache message indicates a write miss –Shared: invalidation message sent to SS, sharing set is {c}, data to c –Exclusive: fetch/invalidate to owner, data to c, SS is {c} –Uncached: data to c, SS is {c} Line is exclusive

July 2004Computer Architecture, Advanced ArchitecturesSlide 123 Continued If a writeback message is sent from a cache (needs to replace a dirty line), the state becomes uncached and SS is empty

July 2004Computer Architecture, Advanced ArchitecturesSlide 124 Cache-only Memory Architecture COMA Data lines have no fixed homes The entire memory associated with each PE is organized as a cache with each line having one or more temporary copies

July 2004Computer Architecture, Advanced ArchitecturesSlide 125 Shared Virtual Memory When a page fault occurs in a node, the page fault handler obtains the page CPU and OS get involved in the process Requires little hardware support

July 2004Computer Architecture, Advanced ArchitecturesSlide 126 Directory-Based Cache Coherence Figure 27.10 States and transitions for a directory entry in a directory- based cache coherence protocol (c is the requesting cache).

July 2004Computer Architecture, Advanced ArchitecturesSlide 127 27.6 Implementing Asymmetric Multiprocessors Figure 27.11 Structure of a ring-based distributed-memory multiprocessor.

July 2004Computer Architecture, Advanced ArchitecturesSlide 128 Link Key component Allows communication with all other nodes Enforces the coherence protocol Contains cache, local directory, interface controllers for both sides

July 2004Computer Architecture, Advanced ArchitecturesSlide 129 Continued Cache miss –Request is sent via the ring –Coherence Within the node, snooping is used to ensure coherence among processor caches, the link’s cache, and main memory –Internode coherence Directory based protocol (TBD)

July 2004Computer Architecture, Advanced ArchitecturesSlide 130 Scalable Coherent Interface (SCI) Figure 27.11 Structure of a ring-based distributed-memory multiprocessor.

July 2004Computer Architecture, Advanced ArchitecturesSlide 131 28 Distributed Multicomputing Computer architects’ dream: connect computers like toy blocks Building multicomputers from loosely connected nodes Internode communication is done via message passing Topics in This Chapter 28.1 Communication by Message Passing 28.2 Interconnection Networks 28.3 Message Composition and Routing 28.4 Building and Using Multicomputers 28.5 Network-Based Distributed Computing 28.6 Grid Computing and Beyond

July 2004Computer Architecture, Advanced ArchitecturesSlide 132 28.1 Communication by Message Passing Figure 28.1 Structure of a distributed multicomputer.

July 2004Computer Architecture, Advanced ArchitecturesSlide 133 How is it different? From a shared-memory multiprocessor –Data transmission does not occur as a result of memory access requests –Message-passing commands activated by the processor –Usually more than one connection from each node to the interconnection network

July 2004Computer Architecture, Advanced ArchitecturesSlide 134 Continued Shared-medium is controlled by bus acquisition Direct, or point-to-point, involves forwarding

July 2004Computer Architecture, Advanced ArchitecturesSlide 135 Router Design Figure 28.2 The structure of a generic router.

July 2004Computer Architecture, Advanced ArchitecturesSlide 136 Routing Decisions Routing tables within the router Routing tags carried with messages Packet routing –Store and forward Wormhole routing –Small portion (flit – flow control digit) is forwarded with the expectation that others will follow

July 2004Computer Architecture, Advanced ArchitecturesSlide 137 Switches May contain queues and routing logic that make them similar to routers Circuit-switched interconnection network –A path is established and data is sent over that path

July 2004Computer Architecture, Advanced ArchitecturesSlide 138 Crossbar Switches Have p^2 crossover points connecting p inputs to p outputs

July 2004Computer Architecture, Advanced ArchitecturesSlide 139 Building Networks from Switches Figure 28.3 Example 2  2 switch with point-to-point and broadcast connection capabilities. Figure 27.2 Butterfly and Beneš networks

July 2004Computer Architecture, Advanced ArchitecturesSlide 140 Interprocess Communication via Messages Figure 28.4 Use of send and receive message-passing primitives to synchronize two processes.

July 2004Computer Architecture, Advanced ArchitecturesSlide 141 Message-Passing Interface MPI library –Broadcasting –Scattering –Gathering –Complete exchange Provides –Persistent communication –Barrier synchronization

July 2004Computer Architecture, Advanced ArchitecturesSlide 142 Interconnection Networks Analogy of a power grid –Some relays are simply merely relays or voltage converters Indirect network –Some nodes are noncomputing nodes Switches and routers

July 2004Computer Architecture, Advanced ArchitecturesSlide 143 28.2 Interconnection Networks Figure 28.5 Examples of direct and indirect interconnection networks.

July 2004Computer Architecture, Advanced ArchitecturesSlide 144 Direct Interconnection Networks Figure 28.6 A sampling of common direct interconnection networks. Only routers are shown; a computing node is implicit for each router.

July 2004Computer Architecture, Advanced ArchitecturesSlide 145 Indirect Interconnection Networks Figure 28.7 Two commonly used indirect interconnection networks.

July 2004Computer Architecture, Advanced ArchitecturesSlide 146 Two Key Parameters Network diameter –Defined in terms of number of hops, ie router to router transfers Torus? 4 Larger diameter higher worst-case latency

July 2004Computer Architecture, Advanced ArchitecturesSlide 147 Continued Bisection width –Minimum number of channels that must be cut to divide the p-node network into two parts containing [p/2] and [p+1/2] nodes –Torus? 8 –The greater this width, the higher the available communications bandwidth –Important when communication is random

July 2004Computer Architecture, Advanced ArchitecturesSlide 148 28.3 Message Composition and Routing Figure 28.8 Messages and their parts for message passing.

July 2004Computer Architecture, Advanced ArchitecturesSlide 149 Wormhole Switching Figure 28.9 Concepts of wormhole switching.

July 2004Computer Architecture, Advanced ArchitecturesSlide 150 Routing algorithms Nondeterministic –Adaptive –Probabilistic or random Fairness –Oblivious Unique path can be determined at the source

July 2004Computer Architecture, Advanced ArchitecturesSlide 151 Continued Row-first routing on 2D mesh or torus networks –Message goes along the row and then down the appropriate column

July 2004Computer Architecture, Advanced ArchitecturesSlide 152 28.4 Building and Using Multicomputers Figure 28.10 A task system and schedules on 1, 2, and 3 computers.

July 2004Computer Architecture, Advanced ArchitecturesSlide 153 Optimal Schedule? Extremely difficult –Task running times are data-dependent –All tasks are not known a priori –Communication time must be considered –Message passing does not occur at task boundaries –Can every node execute every task?

July 2004Computer Architecture, Advanced ArchitecturesSlide 154 Two-tiered Approach Assign tasks and monitor progress and communication behavior Reassign tasks for overloaded nodes and excessive communication Load balancing

July 2004Computer Architecture, Advanced ArchitecturesSlide 155 Building Multicomputers from Commodity Nodes Figure 28.11 Growing clusters using modular nodes.

July 2004Computer Architecture, Advanced ArchitecturesSlide 156 Network-based Distributed Computing What if we drop the assumptions of uniformity and regularity? –Identical or similar computing nodes –Regular interconnections Mix general-purpose and special- purpose resources Use idle time on PCs

July 2004Computer Architecture, Advanced ArchitecturesSlide 157 Clusters LAN Limited span Also referred to as networks of workstations

July 2004Computer Architecture, Advanced ArchitecturesSlide 158 Metacomputer WAN –The internet as an extreme example Nodes are complete computers –End systems or platforms Requires a coordination strategy –Client-server –Peer-to-peer

July 2004Computer Architecture, Advanced ArchitecturesSlide 159 Peer-to-peer Middleware –Helps coordinate the actions of the nodes –Eg, a group communication facility may be used to notify all participating processes of changes in the environment Stirling has argued that for each MFLOPS a supercomputer costs 10 times as much as a PC

July 2004Computer Architecture, Advanced ArchitecturesSlide 160 28.5 Network-Based Distributed Computing Figure 28.12 Network of workstations.

July 2004Computer Architecture, Advanced ArchitecturesSlide 161 Extensions to the OS Maintain status information for efficient use and recovery Task partition and scheduling Initial assignment to nodes and, possibly, load balancing Distributed file system management

July 2004Computer Architecture, Advanced ArchitecturesSlide 162 Continued Compilation of system usage data for planning and accounting Enforcement of intersystem privacy and security policies Some or all of these functions will be incorporated into standard operating systems

July 2004Computer Architecture, Advanced ArchitecturesSlide 163 28.6 Grid Computing and Beyond Computational grid is analogous to the power grid Decouples the “production” and “consumption” of computational power Homes don’t have an electricity generator; why should they have a computer? Advantages of computational grid: Near continuous availability of computational and related resources Resource requirements based on sum of averages, rather than sum of peaks Paying for services based on actual usage rather than peak demand Distributed data storage for higher reliability, availability, and security Universal access to specialized and one-of-a-kind computing resources Still to be worked out: How to charge for computation usage

July 2004Computer Architecture, Advanced ArchitecturesSlide 164 What will it take? Interaction requirements are quite different from the way in which systems interact today –More emphasis on communication capabilities and interfaces –Hardware and OS support for cluster coordination and communication with the outside world

July 2004Computer Architecture, Advanced ArchitecturesSlide 165 Continued Improve communication performance by using lighter-weight interaction, much as in clusters, is needed Internets generally have no centralized control whatsoever. Wide geographic distribution leads to nonnegligible signal propogation time.

July 2004Computer Architecture, Advanced ArchitecturesSlide 166 Continued There are security, interorganizational, and international issues in data exchange

July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures.

Similar presentations

Presentation on theme: "July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures.

Similar presentations

Presentation on theme: "July 2004Computer Architecture, Advanced ArchitecturesSlide 1 Part VII Advanced Architectures."— Presentation transcript:

Similar presentations

About project

Feedback