EECC756 - Shaaban #1 lec # 1 Spring 2003 3-11-2003 Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate.

Slides:

Advertisements

Similar presentations

Threads, SMP, and Microkernels

Advertisements

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

ECE669 L3: Design Issues February 5, 2004 ECE 669 Parallel Computer Architecture Lecture 3 Design Issues.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Lecture 6: Multicore Systems

EECC756 - Shaaban #1 lec # 1 Spring Systolic Architectures Replace single processor with an array of regular processing elements Orchestrate.

CS 213: Parallel Processing Architectures Laxmi Narayan Bhuyan Lecture3.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Types of Parallel Computers

CSCI-455/522 Introduction to High Performance Computing Lecture 2.

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

ECE669 L15: Mid-term Review March 25, 2004 ECE 669 Parallel Computer Architecture Lecture 15 Mid-term Review.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

ECE669 L2: Architectural Perspective February 3, 2004 ECE 669 Parallel Computer Architecture Lecture 2 Architectural Perspective.

EECC756 - Shaaban #1 lec # 2 Spring Parallel Architectures History Application Software System Software SIMD Message Passing Shared Memory.

Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.

Chapter 17 Parallel Processing.

EECC756 - Shaaban #1 lec # 2 Spring Conventional Computer Architecture Abstraction Conventional computer architecture has two aspects: 1.

Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.

Parallel Processing Architectures Laxmi Narayan Bhuyan

EECC756 - Shaaban #1 lec # 2 Spring Parallel Architectures History Application Software System Software SIMD Message Passing Shared Memory.

EECC756 - Shaaban #1 lec # 1 Spring Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Computer architecture II

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Parallel Architectures

Computer System Architectures Computer System Software

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

ECE 568: Modern Comp. Architectures and Intro to Parallel Processing Fall 2006 Ahmed Louri ECE Department.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.

Parallel Computing.

INEL6067 Technology ---> Limitations & Opportunities Wires -Area -Propagation speed Clock Power VLSI -I/O pin limitations -Chip area -Chip crossing delay.

Outline Why this subject? What is High Performance Computing?

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.

Evolution and Convergence of Parallel Architectures Todd C. Mowry CS 495 January 17, 2002.

Diversity and Convergence of Parallel Architectures Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals.

CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.

Background Computer System Architectures Computer System Software.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

COMP7330/7336 Advanced Parallel and Distributed Computing Data Parallelism Dr. Xiao Qin Auburn University

Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 34 Multiprocessors (Shared Memory Architectures) Prof. Dr. M. Ashraf Chughtai.

These slides are based on the book:

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

CMSC 611: Advanced Computer Architecture

CS5102 High Performance Computer Systems Thread-Level Parallelism

Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”

Multi-Processing in High Performance Computer Architecture:

What is Parallel and Distributed computing?

Chapter 17 Parallel Processing

CS 213: Parallel Processing Architectures

Parallel Processing Architectures

Constructing a system with multiple computers or processors

Parallel Computer Architecture

Chapter 4 Multiprocessors

Presentation transcript:

EECC756 - Shaaban #1 lec # 1 Spring Parallel Computer Architecture A parallel computer is a collection of processing elements that cooperate to solve large computational problems fast Broad issues involved: –The concurrency and communication characteristics of parallel algorithms for a given computational problem –Computing Resources and Computation Allocation: The number of processing elements (PEs), computing power of each element and amount of physical memory used. What portions of the computation and data are allocated to each PE. –Data access, Communication and Synchronization How the elements cooperate and communicate. How data is transmitted between processors. Abstractions and primitives for cooperation. –Performance and Scalability Maximize performance enhancement of parallelism: Speedup. –By minimizing parallelization overheads Scalabilty of performance to larger systems/problems.

EECC756 - Shaaban #2 lec # 1 Spring The Need And Feasibility of Parallel Computing Application demands: More computing cycles needed: –Scientific computing: CFD, Biology, Chemistry, Physics,... –General-purpose computing: Video, Graphics, CAD, Databases, Transaction Processing, Gaming… –Mainstream multithreaded programs, are similar to parallel programs Technology Trends –Number of transistors on chip growing rapidly. Clock rates expected to go up but only slowly. Architecture Trends –Instruction-level parallelism is valuable but limited. –Coarser-level parallelism, as in multiprocessor systems is the most viable approach to further improve performance. Economics: –The increased utilization of commodity of-the-shelf (COTS) components in high performance parallel computing systems instead of costly custom components used in traditional supercomputers leading to much lower parallel system cost. Today’s microprocessors offer high-performance and have multiprocessor support eliminating the need for designing expensive custom PEs Commercial System Area Networks (SANs) offer an alternative to custom more costly networks

EECC756 - Shaaban #3 lec # 1 Spring Scientific Computing Demands (Memory Requirement)

EECC756 - Shaaban #4 lec # 1 Spring Scientific Supercomputing Trends Proving ground and driver for innovative architecture and advanced computing techniques: –Market is much smaller relative to commercial segment –Dominated by vector machines starting in the 70s through the 80s –Meanwhile, microprocessors have made huge gains in floating-point performance High clock rates. Pipelined floating point units. Instruction-level parallelism. Effective use of caches. Large-scale multiprocessors and computer clusters are replacing vector supercomputers

EECC756 - Shaaban #5 lec # 1 Spring CPU Performance Trends The microprocessor is currently the most natural building block for multiprocessor systems in terms of cost and performance.

EECC756 - Shaaban #6 lec # 1 Spring General Technology Trends Microprocessor performance increases 50% - 100% per year Transistor count doubles every 3 years DRAM size quadruples every 3 years

EECC756 - Shaaban #7 lec # 1 Spring Clock Frequency Growth Rate Currently increasing 30% per year

EECC756 - Shaaban #8 lec # 1 Spring Transistor Count Growth Rate One billion transistors on chip by early 2004 Transistor count grows much faster than clock rate - Currently 40% per year

EECC756 - Shaaban #9 lec # 1 Spring Parallelism in Microprocessor VLSI Generations SMT: e.g. Intel’s Hyper-threading

EECC756 - Shaaban #10 lec # 1 Spring Uniprocessor Attributes to Performance Performance benchmarking is program-mix dependent. Ideal performance requires a perfect machine/program match. Performance measures: –Cycles per instruction (CPI) –Total CPU time = T = C x  = C / f = I c x CPI x  = I c x (p  + m x k) x  I c = Instruction count  = CPU cycle time p = Instruction decode cycles m = Memory cycles k = Ratio between memory/processor cycles C = Total program clock cycles f = clock rate –MIPS Rate = I c / (T x 10 6 ) = f / (CPI x 10 6 ) = f x I c /(C x 10 6 ) –Throughput Rate: W p = f /(I c x CPI) = (MIPS) x 10 6 /I c Performance factors: (I c, p, m, k,  ) are influenced by: instruction-set architecture, compiler design, CPU implementation and control, cache and memory hierarchy and program instruction mix and instruction dependencies.

EECC756 - Shaaban #11 lec # 1 Spring Raw Uniprocessor Performance: LINPACK Vector Processors Microprocessors

EECC756 - Shaaban #12 lec # 1 Spring Raw Parallel Performance: LINPACK

EECC756 - Shaaban #13 lec # 1 Spring LINPAK Performance Trends Uniprocessor Performance Parallel System Performance

EECC756 - Shaaban #14 lec # 1 Spring Computer System Peak FLOP Rating History/Near Future Petaflop Teraflop

EECC756 - Shaaban #15 lec # 1 Spring The Goal of Parallel Processing Goal of applications in using parallel machines: Maximize Speedup over single processor performance Speedup (p processors) = For a fixed problem size (input data set), performance = 1/time Speedup fixed problem (p processors) = Ideal speedup = number of processors = p Very hard to achieve Performance (p processors) Performance (1 processor) Time (1 processor) Time (p processors)

EECC756 - Shaaban #16 lec # 1 Spring The Goal of Parallel Processing Parallel processing goal is to maximize parallel speedup: Ideal Speedup = p number of processors –Very hard to achieve: Implies no parallelization overheads and perfect load balance among all processors. Maximize parallel speedup by: –Balancing computations on processors (every processor does the same amount of work). – Minimizing communication cost and other overheads associated with each step of parallel program creation and execution. Performance Scalability: Achieve a good speedup for the parallel application on the parallel architecture as problem size and machine size (number of processors) are increased. Sequential Work on one processor Max (Work + Synch Wait Time + Comm Cost + Extra Work) Speedup = < Time(1) Time(p) Parallelization overheads

EECC756 - Shaaban #17 lec # 1 Spring Elements of Parallel Computing Hardware Architecture Operating System Applications Software Computing Problems Problems Algorithms and Data Structures High-levelLanguages Performance Evaluation Evaluation Mapping Programming Binding(Compile,Load)

EECC756 - Shaaban #18 lec # 1 Spring Elements of Parallel Computing 1 Computing Problems: –Numerical Computing: Science and technology numerical problems demand intensive integer and floating point computations. –Logical Reasoning: Artificial intelligence (AI) demand logic inferences and symbolic manipulations and large space searches. 2 Algorithms and Data Structures –Special algorithms and data structures are needed to specify the computations and communication present in computing problems. –Most numerical algorithms are deterministic using regular data structures. –Symbolic processing may use heuristics or non-deterministic searches. –Parallel algorithm development requires interdisciplinary interaction.

EECC756 - Shaaban #19 lec # 1 Spring Elements of Parallel Computing 3 Hardware Resources –Processors, memory, and peripheral devices form the hardware core of a computer system. –Processor instruction set, processor connectivity, memory organization, influence the system architecture. 4 Operating Systems –Manages the allocation of resources to running processes. –Mapping to match algorithmic structures with hardware architecture and vice versa: processor scheduling, memory mapping, interprocessor communication. –Parallelism exploitation at: algorithm design, program writing, compilation, and run time.

EECC756 - Shaaban #20 lec # 1 Spring Elements of Parallel Computing 5 System Software Support –Needed for the development of efficient programs in high- level languages (HLLs.) –Assemblers, loaders. –Portable parallel programming languages –User interfaces and tools. 6 Compiler Support –Preprocessor compiler: Sequential compiler and low-level library of the target parallel computer. –Precompiler: Some program flow analysis, dependence checking, limited optimizations for parallelism detection. –Parallelizing compiler: Can automatically detect parallelism in source code and transform sequential code into parallel constructs.

EECC756 - Shaaban #21 lec # 1 Spring Approaches to Parallel Programming Source code written in Source code written in concurrent dialects of C, C++ FORTRAN, LISP FORTRAN, LISP..Programmer Concurrency preserving compiler Concurrent object code Execution by Execution by runtime system Source code written in Source code written in sequential languages C, C++ FORTRAN, LISP FORTRAN, LISP..Programmer Parallelizing compiler compiler Parallel Parallel object code Execution by Execution by runtime system (a) Implicit (a) Implicit Parallelism Parallelism (b) Explicit (b) Explicit Parallelism Parallelism

EECC756 - Shaaban #22 lec # 1 Spring Factors Affecting Parallel System Performance Parallel Algorithm Related: –Available concurrency and profile, grain, uniformity, patterns. –Required communication/synchronization, uniformity and patterns. –Data size requirements. –Communication to computation ratio. Parallel program Related: –Programming model used. –Resulting data/code memory requirements, locality and working set characteristics. –Parallel task grain size. –Assignment: Dynamic or static. –Cost of communication/synchronization. Hardware/Architecture related: –Total CPU computational power available. –Types of computation modes supported. –Shared address space Vs. message passing. –Communication network characteristics (topology, bandwidth, latency) –Memory hierarchy properties.

EECC756 - Shaaban Evolution of Computer Architecture Scalar Sequential Lookahead I/E Overlap Functional Parallelism Multiple Func. Units Pipeline Implicit Vector Explicit Vector MIMDSIMD MultiprocessorMulticomputer Register-to -Register Memory-to -Memory Processor Array Associative Processor Massively Parallel Processors (MPPs) I/E: Instruction Fetch and Execute SIMD: Single Instruction stream over Multiple Data streams MIMD: Multiple Instruction streams over Multiple Data streams Computer Clusters

EECC756 - Shaaban #24 lec # 1 Spring Parallel Architectures History Application Software System Software SIMD Message Passing Shared Memory Dataflow Systolic Arrays Architecture Historically, parallel architectures tied to programming models Divergent architectures, with no predictable pattern of growth.

EECC756 - Shaaban #25 lec # 1 Spring Parallel Programming Models Programming methodology used in coding applications Specifies communication and synchronization Examples: –Multiprogramming: No communication or synchronization at program level. A number of independent programs. –Shared memory address space: Parallel program threads or tasks communicate using a shared memory address space –Message passing: Explicit point to point communication is used between parallel program tasks. –Data parallel: More regimented, global actions on data –Can be implemented with shared address space or message passing

EECC756 - Shaaban #26 lec # 1 Spring Flynn’s 1972 Classification of Computer Architecture Single Instruction stream over a Single Data stream (SISD): Conventional sequential machines. Single Instruction stream over Multiple Data streams (SIMD): Vector computers, array of synchronized processing elements. Multiple Instruction streams and a Single Data stream (MISD): Systolic arrays for pipelined execution. Multiple Instruction streams over Multiple Data streams (MIMD): Parallel computers: Shared memory multiprocessors. Multicomputers: Unshared distributed memory, message-passing used instead.

EECC756 - Shaaban #27 lec # 1 Spring Flynn’s Classification of Computer Architecture Fig. 1.3 page 12 in Advanced Computer Architecture: Parallelism, Scalability, Programmability, Hwang, 1993.

EECC756 - Shaaban #28 lec # 1 Spring Current Trends In Parallel Architectures The extension of “computer architecture” to support communication and cooperation: –OLD: Instruction Set Architecture –NEW: Communication Architecture Defines: –Critical abstractions, boundaries, and primitives (interfaces) –Organizational structures that implement interfaces (hardware or software) Compilers, libraries and OS are important bridges today

EECC756 - Shaaban #29 lec # 1 Spring Modern Parallel Architecture Layered Framework CAD MultiprogrammingShared address Message passing Data parallel DatabaseScientific modeling Parallel applications Programming models Communication abstraction User/system boundary Compilation or library Operating systems support Communication hardware Physical communication medium Hardware/software boundary

EECC756 - Shaaban #30 lec # 1 Spring Shared Address Space Parallel Architectures Any processor can directly reference any memory location –Communication occurs implicitly as result of loads and stores Convenient: –Location transparency –Similar programming model to time-sharing in uniprocessors Except processes run on different processors Good throughput on multiprogrammed workloads Naturally provided on a wide range of platforms –Wide range of scale: few to hundreds of processors Popularly known as shared memory machines or model –Ambiguous: Memory may be physically distributed among processors

EECC756 - Shaaban #31 lec # 1 Spring Shared Address Space (SAS) Parallel Programming Model Process: virtual address space plus one or more threads of control Portions of address spaces of processes are shared Writes to shared address visible to other threads (in other processes too) Natural extension of the uniprocessor model: Conventional memory operations used for communication Special atomic operations needed for synchronization OS uses shared memory to coordinate processes

EECC756 - Shaaban #32 lec # 1 Spring Models of Shared-Memory Multiprocessors The Uniform Memory Access (UMA) Model: –The physical memory is shared by all processors. –All processors have equal access to all memory addresses. –Also referred to as Symmetric Memory Processors (SMPs). Distributed memory or Nonuniform Memory Access (NUMA) Model: –Shared memory is physically distributed locally among processors. Access to remote memory is higher. The Cache-Only Memory Architecture (COMA) Model: –A special case of a NUMA machine where all distributed main memory is converted to caches. –No memory hierarchy at each processor.

EECC756 - Shaaban #33 lec # 1 Spring Models of Shared-Memory Multiprocessors M  MM Network P $ P $ P $  Network D P C D P C D P C Distributed memory or Nonuniform Memory Access (NUMA) Model Uniform Memory Access (UMA) Model or Symmetric Memory Processors (SMPs). Interconnect: Bus, Crossbar, Multistage network P: Processor M: Memory C: Cache D: Cache directory Cache-Only Memory Architecture (COMA)

EECC756 - Shaaban #34 lec # 1 Spring Uniform Memory Access Example: Intel Pentium Pro Quad All coherence and multiprocessing glue in processor module Highly integrated, targeted at high volume Low latency and bandwidth

EECC756 - Shaaban #35 lec # 1 Spring Uniform Memory Access Example: SUN Enterprise –16 cards of either type: processors + memory, or I/O –All memory accessed over bus, so symmetric –Higher bandwidth, higher latency bus

EECC756 - Shaaban #36 lec # 1 Spring Distributed Shared-Memory Multiprocessor System Example: Cray T3E Scale up to 1024 processors, 480MB/s links Memory controller generates communication requests for nonlocal references No hardware mechanism for coherence (SGI Origin etc. provide this)

EECC756 - Shaaban #37 lec # 1 Spring Message-Passing Multicomputers Comprised of multiple autonomous computers (nodes) connected via a suitable network. Each node consists of one or more processors, local memory, attached storage and I/O peripherals. Local memory is only accessible by local processors in a node. Inter-node communication is carried out by message passing through the connection network Process communication achieved using a message-passing programming environment. –Programming model more removed from basic hardware operations Include: –A number of commercial Massively Parallel Processor systems (MPPs). –Computer clusters that utilize commodity of-the-shelf (COTS) components.

EECC756 - Shaaban #38 lec # 1 Spring Message-Passing Abstraction Send specifies buffer to be transmitted and receiving process Receive specifies sending process and application storage to receive into Memory to memory copy possible, but need to name processes Optional tag on send and matching rule on receive User process names local data and entities in process/tag space too In simplest form, the send/receive match achieves pairwise synch event Many overheads: copying, buffer management, protection Process PProcessQ Address Y AddressX SendX, Q, t ReceiveY, P, t Match Local process address space Local process address space

EECC756 - Shaaban #39 lec # 1 Spring Message-Passing Example: IBM SP-2 Made out of essentially complete RS6000 workstations Network interface integrated in I/O bus (bandwidth limited by I/O bus)

EECC756 - Shaaban #40 lec # 1 Spring Message-Passing Example: Intel Paragon

EECC756 - Shaaban #41 lec # 1 Spring Message-Passing Programming Tools Message-passing programming environments include: –Message Passing Interface (MPI): Provides a standard for writing concurrent message-passing programs. MPI implementations include parallel libraries used by existing programming languages. –Parallel Virtual Machine (PVM): Enables a collection of heterogeneous computers to used as a coherent and flexible concurrent computational resource. PVM support software executes on each machine in a user- configurable pool, and provides a computational environment of concurrent applications. User programs written for example in C, Fortran or Java are provided access to PVM through the use of calls to PVM library routines.

EECC756 - Shaaban #42 lec # 1 Spring Data Parallel Systems SIMD in Flynn taxonomy Programming model –Operations performed in parallel on each element of data structure –Logically single thread of control, performs sequential or parallel steps –Conceptually, a processor is associated with each data element Architectural model –Array of many simple, cheap processors each with little memory Processors don’t sequence through instructions –Attached to a control processor that issues instructions –Specialized and general communication, cheap global synchronization Example machines: –Thinking Machines CM-1, CM-2 (and CM-5) –Maspar MP-1 and MP-2,

EECC756 - Shaaban #43 lec # 1 Spring Dataflow Architectures Represent computation as a graph of essential dependences –Logical processor at each node, activated by availability of operands –Message (tokens) carrying tag of next instruction sent to next processor –Tag compared with others in matching store; match fires execution Research Dataflow machine prototypes include: The MIT Tagged Architecture The Manchester Dataflow Machine

EECC756 - Shaaban #44 lec # 1 Spring Systolic Architectures Replace single processor with an array of regular processing elements Orchestrate data flow for high throughput with less memory access Different from pipelining –Nonlinear array structure, multidirection data flow, each PE may have (small) local instruction and data memory Different from SIMD: each PE may do something different Initial motivation: VLSI enables inexpensive special-purpose chips Represent algorithms directly by chips connected in regular pattern

EECC756 - Shaaban #45 lec # 1 Spring Systolic Array Example: 3x3 Systolic Array Matrix Multiplication b2,2 b2,1 b1,2 b2,0 b1,1 b0,2 b1,0 b0,1 b0,0 a0,2 a0,1 a0,0 a1,2 a1,1 a1,0 a2,2 a2,1 a2,0 Alignments in time Processors arranged in a 2-D grid Each processor accumulates one element of the product Rows of A Columns of B T = 0 Example source:

EECC756 - Shaaban #46 lec # 1 Spring Systolic Array Example: 3x3 Systolic Array Matrix Multiplication b2,2 b2,1 b1,2 b2,0 b1,1 b0,2 b1,0 b0,1 a0,2 a0,1 a1,2 a1,1 a1,0 a2,2 a2,1 a2,0 Alignments in time Processors arranged in a 2-D grid Each processor accumulates one element of the product T = 1 b0,0 a0,0 a0,0*b0,0 Example source:

EECC756 - Shaaban #47 lec # 1 Spring Systolic Array Example: 3x3 Systolic Array Matrix Multiplication b2,2 b2,1 b1,2 b2,0 b1,1 b0,2 a0,2 a1,2 a1,1 a2,2 a2,1 a2,0 Alignments in time Processors arranged in a 2-D grid Each processor accumulates one element of the product T = 2 b1,0 a0,1 a0,0*b0,0 + a0,1*b1,0 a1,0 a0,0 b0,1 b0,0 a0,0*b0,1 a1,0*b0,0 Example source:

EECC756 - Shaaban #48 lec # 1 Spring Systolic Array Example: 3x3 Systolic Array Matrix Multiplication b2,2 b2,1 b1,2 a1,2 a2,2 a2,1 Alignments in time Processors arranged in a 2-D grid Each processor accumulates one element of the product T = 3 b2,0 a0,2 a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a1,1 a0,1 b1,1 b1,0 a0,0*b0,1 + a0,1*b1,1 a1,0*b0,0 + a1,1*b1,0 a1,0 b0,1 a0,0 b0,0 b0,2 a2,0 a1,0*b0,1 a0,0*b0,2 a2,0*b0,0 Example source:

EECC756 - Shaaban #49 lec # 1 Spring Systolic Array Example: 3x3 Systolic Array Matrix Multiplication b2,2 a2,2 Alignments in time Processors arranged in a 2-D grid Each processor accumulates one element of the product T = 4 a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a1,2 a0,2 b2,1 b2,0 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1 a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0 a1,1 b1,1 a0,1 b1,0 b1,2 a2,1 a1,0*b0,1 +a1,1*b1,1 a0,0*b0,2 + a0,1*b1,2 a2,0*b0,0 + a2,1*b1,0 b0,1 a1,0 b0,2 a2,0 a2,0*b0,1 a1,0*b0,2 a2,2 Example source:

EECC756 - Shaaban #50 lec # 1 Spring Systolic Array Example: 3x3 Systolic Array Matrix Multiplication Alignments in time Processors arranged in a 2-D grid Each processor accumulates one element of the product T = 5 a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1 a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0 a1,2 b2,1 a0,2 b2,0 b2,2 a2,2 a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1 a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2 a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0 b1,1 a1,1 b1,2 a2,1 a2,0*b0,1 + a2,1*b1,1 a1,0*b0,2 + a1,1*b1,2 b0,2 a2,0 a2,0*b0,2 Example source:

EECC756 - Shaaban #51 lec # 1 Spring Systolic Array Example: 3x3 Systolic Array Matrix Multiplication Alignments in time Processors arranged in a 2-D grid Each processor accumulates one element of the product T = 6 a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1 a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0 a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1 a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2 a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0 b2,1 a1,2 b2,2 a2,2 a2,0*b0,1 + a2,1*b1,1 + a2,2*b2,1 a1,0*b0,2 + a1,1*b1,2 + a1,2*b2,2 b1,2 a2,1 a2,0*b0,2 + a2,1*b1,2 Example source:

EECC756 - Shaaban #52 lec # 1 Spring Systolic Array Example: 3x3 Systolic Array Matrix Multiplication Alignments in time Processors arranged in a 2-D grid Each processor accumulates one element of the product T = 7 a0,0*b0,0 + a0,1*b1,0 + a0,2*b2,0 a0,0*b0,1 + a0,1*b1,1 + a0,2*b2,1 a1,0*b0,0 + a1,1*b1,0 + a1,2*a2,0 a1,0*b0,1 +a1,1*b1,1 + a1,2*b2,1 a0,0*b0,2 + a0,1*b1,2 + a0,2*b2,2 a2,0*b0,0 + a2,1*b1,0 + a2,2*b2,0 a2,0*b0,1 + a2,1*b1,1 + a2,2*b2,1 a1,0*b0,2 + a1,1*b1,2 + a1,2*b2,2 b2,2 a2,2 a2,0*b0,2 + a2,1*b1,2 + a2,2*b2,2 Done Example source: