Multiprocessor Concluding Remarks, Vector Processors Ch 8.10, Appendix B.

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

CPE 631: Vector Processing (Appendix F in COA4)

Computer Organization and Architecture

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

The University of Adelaide, School of Computer Science

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

1 RISC Machines Because of their load-store ISAs, RISC architectures require a large number of CPU registers. These register provide fast access to data.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Compiler Challenges for High Performance Architectures

Reference: Message Passing Fundamentals.

Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

11/14/05ELEC Fall Multi-processor SoCs Yijing Chen.

ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.

1: Operating Systems Overview

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

Τσόλκας Χρήστος & Αντωνίου Χρυσόστομος

Chapter 17 Parallel Processing.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Pipelining By Toan Nguyen.

(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Prince Sultan College For Woman

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

Basics and Architectures

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

1 Superscalar Pipelines 11/24/08. 2 Scalar Pipelines A single k stage pipeline capable of executing at most one instruction per clock cycle. All instructions,

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.

Main Memory CS448.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Chapter 4 MARIE: An Introduction to a Simple Computer.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Vector Processors Prof. Sivarama Dandamudi School of Computer Science Carleton University.

Computer Architecture 2 nd year (computer and Information Sc.)

Pipelining and Parallelism Mark Staveley

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Instruction Sets: Addressing modes and Formats Group #4  Eloy Reyes  Rafael Arevalo  Julio Hernandez  Humood Aljassar Computer Design EEL 4709c Prof:

Outline Why this subject? What is High Performance Computing?

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Vector computers.

Parallel Processing - introduction

Architecture & Organization 1

CS203 – Advanced Computer Architecture

Morgan Kaufmann Publishers

COMP4211 : Advance Computer Architecture

Parallel and Multiprocessor Architectures

Architecture & Organization 1

Multivector and SIMD Computers

Dynamic Hardware Prediction

Memory System Performance Chapter 3

Presentation transcript:

Multiprocessor Concluding Remarks, Vector Processors Ch 8.10, Appendix B

Building Parallel Processors Progress has been slow in using effective and efficient parallel processors –Difficult software problems –Long process of evolving architetures of multiprocessors Dealign with lon gremot eaccess Coherency Synchronization But most are optimistic about the future for parallel processing –Beginning to understand how to write parallel algorithms Initial demand in scientific and engineering problems Growing demand in large-scale database systems –Becoming more difficult to squeeze performance out of uniprocessors –Effective for today’s multiprogrammed workloads (e.g. servers)

Future for MPP MPP = Massively Parallel Processors –Largest-scale multiprocessors “Hennessy and Patterson should move MPPs to Chapter 11” –Jim Gray when asked about coverage of MPPs in the textbook (Microsoft Research) –Supercomputing companies have not fared so well in the marketplace (e.g. Cray bought by SGI) Very small market for large machines (> 64 nodes) and typically > $5 million Inexpensive today to build small-scale snooping-bus schemes –But this is not the same as MPP machines

Building MPPs (1) Challenge is to scale up in a cost-effective manner. Four alternatives today –Natural scaling via proprietary interconnect and communications technology. Intel Paragon, Cray T3D Not cost-effective at small scales, and programming model is typically unique to the particular architecture –Clusters of mid-range machines with proprietary and standard technologies Convex Exemplar, SGI Challenge array Try to use cost-optimized building blocks, add proprietary where needed (typically in interconnect) Programming model is also typically unique to the architecture

Building MPPs (2) Cost Effective Approaches –Use COTS uniprocessor nodes and custom interconnect. IBM SP-2 Node becomes a workstation, use message passing to share data –Poor performance for communications-rich applications –Use entirely COTS technology Beowulf Cluster Commodity processors, commodity network (e.g. ATM or high-speed Ethernet) Standard programming interfaces also exist System generally built as COWs (clusters of workstations) Inexpensive but also suffers if applications require high communications cost

Microprocessor Architecture Future Prospects for more ILP somewhat limited Continued technology improvements likely to improve the clock rate –Starting to see tradeoffs though, e.g. higher failure rate of transistors –Larger effect of noise with smaller components Authors predict a slowdown in performance growth, slower than doubling every 18 months Of course, all predictions are just SWAGs…

Some Prior Statements "Everything that can be invented has been invented". US Commissioner of Patents, "I think there is a world market for about five computers". Thomas J Watson Sr., IBM founder, "There is no reason for any individuals to have a computer in their home". Ken Olsen, CEO of Digital Equipment Corp., "The current rate of progress can't continue much longer. We're facing fundamental problems that we didn't have to deal with before". Various computer technologists,

Trend by 2010? Wire delay may become more prominent –Latency for signals across chip –Unavoidable? …go optical, quantum, or parallel Compilers getting better –Can rely on them for more parallelism –2-4 CPU machines more common today –Perhaps even multiple parallel processors on one chip (e.g. Stanford CMP, 16 simple but fast processors, own L1, share L2) Fewer manufacturers –Billions to construct fabs –Huge cost for design, test, validation

Vector Processors Appendix B Most well-known is perhaps the Cray I Essentially a SIMD machine –Small-scale versions in place today on commodity processors with MMX, SSE, Velocity Engine Programming is similar to that of a uniprocessor machine, but can take advantage of parallelism when we run into performance barriers from pipelining

What is a Vector Processor? Provides high-level operations that work on vectors –Vector is a linear array of numbers Type of number can vary (IEEE 754, 2’s complement) Length of the array also varies depending on hardware –Example vectors would be 64 or 128 elements in length –Small vectors (e.g. MMX/SSE) are about 4 elements in length –Example usage: Add two 64-element floating point vectors to obtain a single 64-element vector result Performed in parallel instead of sequentially Vector instruction equivalent to a loop (up to the vector length) with each iteration computing one of the results, update indices, branch back

Vector Processor Properties Computation of each result must be independent of previous results –I.e. need absence of data hazards Single vector instruction specifies a great deal of work –Equivalent to executing an entire loop Vector instructions must access memory in a known access pattern –Need vector elements to be located adjacent; can then fetch them from heavily interleaved memory banks quickly –Latency of data to memory should only be one for the entire vector, not for each word of the vector Many control hazards can be avoided since the entire loop is replaced by a vector instruction

Basic Vector Architecture Vector processor typically consists of –Ordinary pipelined scalar unit –Add a vector unit that can deal with FP or Integers Generally use a vector-register processor –All vector operations except load/store are among vector register. (vector-memory would go to mem) –Advantages the same as our load/store uniprocessor reasons

Primary Components of the Vector Processor Vector Registers –Like a regular register, but holds an entire array of data (In DLXV, there are 8 vector registers, each holding 64 elements) Vector functional units –Fully pipelined –Operates like our old functional units; need to detect hazards and stall when necessary Vector Load/Store unit –Load/Store instructions can transmit entire array at once Need high-bandwidth memory Will want pipelined writes Could also handle scalar loads/stores Set of Scalar Registers –Normal general purpose registers, could use to load vectors

Vectorization Concepts Vectorization occurs for operations on arrays Vectorization occurs in loops (explicit or implicit) of any type Only innermost loops are vectorized Data dependencies can inhibit vectorization; results are then computed serially. Vector registers allow array values to be stored very close the the functional units. Once vector registers are loaded, operands can be pumped into the functional units (an results generated) every clock period due to pipelining –Ideally one FU per vector element, but this may be unlikely Vectorization increases sustained performance by increasing bandwidth of data flow into the functional units.

Vectorization Example Loop will vectorize automatically (typically coded in FORTRAN!) DO I=1,N A(I) = X(I) + Y(I) D(I) = E(I) * COS(F(I)) END DO Load elements into vector registers Pump values in register through functional units.

Vectorization Speedup Real performance is determined by number of results that can be calculated in functional units per clock period (as in serial computation). –Convoy : set of instructions that could begin execution in the same clock cycle without hazards –Chime : execution time for a vector sequence of convoys –m convoys execute in m chimes; for a vector of length n, approximately m*n clock cycles to complete Vector registers help sustain high performance by increasing bandwidth to the functional units. Serial computers have trouble keeping the functional units busy.

Vectorization Speedup Vectorized speedup is limited by vector loads and operations that don’t chain efficiently. Typically see 10x speedup over serial computation of same loop. –i.e. data hazards cause problems Most efficient vectors are a multiple N of vector size V; least efficient if vectors are of size N*V + 1 (last vector load not amortized) –Similar idea to loop unrolling, but with hardware support

Vectorization Inhibitors Pretty much our list of usual suspects that hurt ILP: Subroutine and function calls –but can inline them and perhaps use vectorization I/O statements Arithmetic IF, GOTO Partial word data (character) operations Unstructured branches Data dependencies

Dependence Example Loop will not vectorize, must be computed serially: DO I=2,N-1 A(I) = B(I) + A(I-1) END DO Compiler detects backward reference on A(I-1). Loop will vectorize: DO I=2,N-1 A(I) = B(I) + A(I+1) END DO A(I+1) is a forward reference, same result in serial or vector mode. Compiler uses non-updated value.

Dependence Example DO I=N,2,-1 A(I) = B(I) + A(I+1) END DO Will this loop vectorize? (I=N; I>2; I--)

DLX/DLXV Example DLX Code: LDF0, A ADDIR4,Rx, #512 ; Last addr Loop:LDF2, 0(Rx) MULTD F2, F0, F2 ; A * X[I] LDF4, 0(Ry) ADDDF4, F2, F4 ; + Y[I] SD0(Ry), F4 ADDIRx, Rx, #8 ; Inc index ADDIRy, Ry, #8 SUBR20, R4, Rx BNEZR20, Loop DLXV Code: LDF0, A LVV1, Rx ; Load vecX MULTSV V2, F0, V1 ; Vec Mult LV V3, Ry; Load vecY ADDVV4, V2, V3; Vec Add SVRy, V4; Store result Loop goes 64 times 64 is element size in DLXV So we need no loop now Great reduction in instruction bandwidth! Only stalls per vector operation, not per element

Vector Load-Store and Memory More complex than normal memory access for a functional unit; can use some of the ideas we discussed for improving memory access Start-up time –Time to get the first word from memory into a register –Vector Unit could start execution on the first word as the rest of the vector is loaded Most vector processors use multiple memory banks as opposed to interleaving –Supports multiple simultaneous accesses –Many vector processors support the ability to load or store data that is not sequential –May also use SRAM as main memory to avoid high memory startup costs

Vector Length We would like loops to iterate the same number of times that we have elements in a vector –But unlikely in a real program –Also the number of iterations might be unknown at compile time Problem: n, number of iterations, greater than MVL (Maximum Vector Length) –Solution: Strip Mining, just like we did with loop unrolling –Create one loop that iterates a multiple of MVL times –Create a final loop that handles any remaining iterations, which must be less than MVL

Strip Mining Example Low=1 VL = (n mod MVL); Find odd-sized piece Do 1 j=0,(n/MVL); Outer Loop –Do 10 I = low, low+VL-1 ; runs for length VL Y(I) = a*X(I)+Y(I); Main operation –10 continue –Low = low + VL –VL = MVL 1 Continue Executes loop in blocks of MVL Inner loop can be vectorized

Vector Stride Position of the elements we want in memory may not be sequential Consider following code: –Do 10 I=1, 100 Do 10 j =1, 100 –A(I,j) = 0.0 –Do 10 k =1,100 –A(I,j) = A(I,j) + B(I,k)*C(k,j) –10 X X+5 X+10 X+15 X+20 If loop accesses data by column, vector loaded with non- sequential data Matrix Data in Memory

Vector Stride Distance separating elements to be gathered into a vector register is the stride Vectors may be loaded with non-unit stride Vector register behaves as if all data is contiguous Can provide major advantage over cache-based processor –Cache inherently deals with unit stride data Vector processor must be able to compute the stride dynamically since the matrix size may not be known at compile time –Solution is to store it in a GPR

Improving Vector Performance Better compiler techniques –As with all other techniques, we may be able to rearrange code to increase the amount of vectorization Techniques for accessing sparse matrices –Hardware support to move between dense (no zeros), and normal (include zeros) representations Chaining –Same idea as forwarding in pipelining –Consider: MULTV V1, V2, V3 ADDVV4, V1, V5 –ADDV must wait for MULTV to finish But we could implement forwarding; as each element from the MULTV finishes, send it off to the ADDV to start work

Chaining Example Unchained MULTV ADDV Total = 141 Chained 7 64 MULTV Total = ADDV 6 and 7 cycles are start-up-times of the adder and multiplier Every vector processor today performs chaining

Improving Performance Conditionally Executed Statements –Consider the following loop Do 100 I=1, 64 –If (A(I).ne. 0) then »A(I)=A(I)-B(I) –Endif 100 continue –Not vectorizable due to the conditional statement –But we could vectorize this if we could somehow only include in the vector operation those elements where A(I) != 0

Conditional Execution Solution: Create a vector mask of bits that corresponds to each vector element –1=apply operation –0=leave alone As long as we properly set the mask first, we can now vectorize the previous loop with the conditional Implemented on most vector processors today

Concluding Remarks First supercomputers were vector processors –Gap has closed with the advent of fast, pipelined systems –Idea of small-scale vector processing has re-surfaced with commodity processors Most usage of vector processing today is in scientific computing –Requires large memory bandwidth –Compiler support also important –Days of vector processors may be numbered, more emphasis today on distributed processing, clusters, massively parallel processors