SIMD and Associative Computational Models

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Parallel computer architecture classification
Parallell Processing Systems1 Chapter 4 Vector Processors.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
\course\eleg652-03F\Topic1a- 03F.ppt1 Vector and SIMD Computers Vector computers SIMD.

Data Manipulation Computer System consists of the following parts:
Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.
Lecture 10 Outline Material from Chapter 2 Interconnection networks Processor arrays Multiprocessors Multicomputers Flynn’s taxonomy.
From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis.
Flynn’s Taxonomy of Computer Architectures Source: Wikipedia Michael Flynn 1966 CMPS 5433 – Parallel Processing.
Prince Sultan College For Woman
Parallel Architectures
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Basics and Architectures
10-1 Chapter 10 - Advanced Computer Architecture Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring.
1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.
HPC Technology Track: Foundations of Computational Science Lecture 2 Dr. Greg Wettstein, Ph.D. Research Support Group Leader Division of Information Technology.
1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
CSCI 232© 2005 JW Ryder1 Parallel Processing Large class of techniques used to provide simultaneous data processing tasks Purpose: Increase computational.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
An Overview of Parallel Computing. Hardware There are many varieties of parallel computing hardware and many different architectures The original classification.
Chapter 9: Alternative Architectures In this course, we have concentrated on single processor systems But there are many other breeds of architectures:
Flynn’s Architecture. SISD (single instruction and single data stream) SIMD (single instruction and multiple data streams) MISD (Multiple instructions.
Introduction to MMX, XMM, SSE and SSE2 Technology
Parallel Computing.
Outline Why this subject? What is High Performance Computing?
Parallel Processing Presented by: Wanki Ho CS147, Section 1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
Parallel Computing Presented by Justin Reschke
Lecture # 10 Processors Microcomputer Processors.
CPIT Program Execution. Today, general-purpose computers use a set of instructions called a program to process data. A computer executes the.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Processor Level Parallelism 1
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Single Instruction Multiple Data
Flynn’s Taxonomy Many attempts have been made to come up with a way to categorize computer architectures. Flynn’s Taxonomy has been the most enduring of.
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
CMSC 611: Advanced Computer Architecture
Distributed Processors
Parallel Processing - introduction
Course Outline Introduction in algorithms and applications
CS 147 – Parallel Processing
Morgan Kaufmann Publishers
What is Parallel and Distributed computing?
Pipelining and Vector Processing
Chapter 17 Parallel Processing
Outline Interconnection networks Processor arrays Multiprocessors
buses, crossing switch, multistage network.
Overview Parallel Processing Pipelining
AN INTRODUCTION ON PARALLEL PROCESSING
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
Presentation transcript:

SIMD and Associative Computational Models Parallel & Distributed Algorithms

SIMD and Associative Computational Models Part I: SIMD Model

Flynn’s Taxonomy The best known classification scheme for parallel computers. Depends on parallelism they exhibit with Instruction streams Data streams A sequence of instructions (the instruction stream) manipulates a sequence of operands (the data stream) The instruction stream (I) and the data stream (D) can be either single (S) or multiple (M) Four combinations: SISD, SIMD, MISD, MIMD

Flynn’s Taxonomy (cont.) SISD Single Instruction Stream, Single Data Stream Most important member is a sequential computer Some argue other models included as well. SIMD Single Instruction Stream, Multiple Data Streams One of the two most important in Flynn’s Taxonomy MISD Multiple Instruction Streams, Single Data Stream Relatively unused terminology. Some argue that this includes pipeline computing. MIMD Multiple Instructions, Multiple Data Streams An important classification in Flynn’s Taxonomy

The SIMD Computer & Model Consists of two types of processors: A front-end or control unit Stores a copy of the program Has a program control unit to execute program Broadcasts parallel program instructions to the array of processors. Array of processors of simplistic processors that are functionally more like an ALU. Does not store a copy of the program nor have a program control unit. Executes the commands in parallel sent by the front end.

SIMD (cont.) On a memory access, all active processors must access the same location in their local memory. All active processor executes the same instruction synchronously, but on different data The sequence of different data items is often referred to as a vector.

Alternate Names for SIMDs Recall that all active processors of a SIMD computer must simultaneously access the same memory location. The value in the i-th processor can be viewed as the i-th component of a vector. SIMD machines are sometimes called vector computers [Jordan,et.al.] or processor arrays [Quinn 94,04] based on their ability to execute vector and matrix operations efficiently.

Alternate Names (cont.) In particular, the Quinn Textbook for this course, Quinn calls a SIMD a processor array. Quinn and a few others also considers a pipelined vector processor to be a SIMD This is a somewhat non-standard use of the term. An example is the Cray-1

How to View a SIMD Machine Think of soldiers all in a unit. A commander selects certain soldiers as active – for example, every even numbered row. The commander barks out an order that all the active soldiers should do and they execute the order synchronously.

SIMD Execution Style Collectively, the individual memories of the processing elements (PEs) store the (vector) data that is processed in parallel. When the front end encounters an instruction whose operand is a vector, it issues a command to the PEs to perform the instruction in parallel. Although the PEs execute in parallel, some units can be allowed to skip any particular instruction.

SIMD Computers SIMD computers that focus on vector operations Support some vector and possibly matrix operations in hardware Usually limit or provide less support for non-vector type operations involving data in the “vector components”. General purpose SIMD computers Support more traditional type operations (e.g., other than for vector/matrix data types). Usually also provide some vector and possibly matrix operations in hardware.

Possible Architecture for a Generic SIMD

Interconnection Networks for SIMDs No specific interconnection network is specified. 2D mesh has been used more more frequently than others. Even hybrid networks (e.g., cube connected cycles) have been used.

Example of a 2-D Processor Interconnection Network in a SIMD Each VLSI chip has 16 processing elements. Each PE can simultaneously send a value to a specific neighbor (e.g., their left neighbor). PE = processor element

SIMD Execution Style The traditional (SIMD, vector, processor array) execution style ([Quinn 94, pg 62], [Quinn 2004, pgs 37-43]: The sequential processor that broadcasts the commands to the rest of the processors is called the front end or control unit. The front end is a general purpose CPU that stores the program and the data that is not manipulated in parallel. The front end normally executes the sequential portions of the program. Each processing element has a local memory that can not be directly accessed by the host or other processing elements.

SIMD Execution Style Collectively, the individual memories of the processing elements (PEs) store the (vector) data that is processed in parallel. When the front end encounters an instruction whose operand is a vector, it issues a command to the PEs to perform the instruction in parallel. Although the PEs execute in parallel, some units can be allowed to skip any particular instruction.

Masking on Processor Arrays All the processors work in lockstep except those that are masked out (by setting mask register). The parallel if-then-else is frequently used in SIMDs to set masks, Every active processor tests to see if its data meets the negation of the boolean condition. If it does, it sets its mask bit so those processors will not participate in the operation initially. Next the unmasked processors, execute the THEN part. Afterwards, mask bits (for original set of active processors) are flipped and unmasked processors perform the the ELSE part. Note: differs from the sequential version of “If”

if (COND) then A else B

if (COND) then A else B

if (COND) then A else B

Data Parallelism (A strength for SIMDs) All tasks (or processors) apply the same set of operations to different data. Example: . Accomplished on SIMDs by having all active processors execute the operations synchronously MIMDs can also handle data parallel execution, but must synchronize more frequently. for i  0 to 99 do a[i]  b[i] + c[i] endfor

Functional/Control/Job Parallelism (A Strictly-MIMD Paradigm) Independent tasks apply different operations to different data elements First and second statements execute concurrently Third and fourth statements execute concurrently a  2 b  3 m  (a + b) / 2 s  (a2 + b2) / 2 v  s - m2

SIMD Machines An early SIMD computer designed for vector and matrix processing was the Illiac IV computer built at the University of Illinois See Jordan et. al., pg 7 The MPP, DAP, the Connection Machines CM-1 and CM-2, MasPar MP-1 and MP-2 are examples of SIMD computers See Akl pg 8-12 and [Quinn, 94]

SIMD Machines Quinn [1994, pg 63-67] discusses the CM-2 Connection Machine and a smaller & updated CM-200. Professor Batcher was the chief architect for the STARAN and the MPP (Massively Parallel Processor) and an advisor for the ASPRO ASPRO is a small second generation STARAN used by the Navy in the spy planes. Professor Batcher is best known architecturally for the MPP, which is at the Smithsonian Institute & currently displayed at a D.C. airport.

Today’s SIMDs Many SIMDs are being embedded in SISD machines. Others are being build as part of hybrid architectures. Others are being build as special purpose machines, although some of them could classify as general purpose. Much of the recent work with SIMD architectures is proprietary.

A Company Building Inexpensive SIMD WorldScape is producing a COTS (commodity off the shelf) SIMD Not a traditional SIMD as The PEs are full-fledged CPU’s the hardware doesn’t synchronize every step. Hardware design supports efficient synchronization Their machine is programmed like a SIMD. The U.S. Navy has observed that their machines process radar a magnitude faster than others. There is quite a bit of information about their work at http://www.wscape.com

An Example of a Hybrid SIMD Embedded Massively Parallel Accelerators Systola 1024: PC add-on board with 1024 processors Fuzion 150: 1536 processors on a single chip Other accelerators: Decypher, Biocellerator, GeneMatcher2, Kestrel, SAMBA, P-NAC, Splash-2, BioScan (This and next three slides are due to Prabhakar R. Gudla (U of Maryland) at a CMSC 838T Presentation, 4/23/2003.)

High speed Myrinet switch Hybrid Architecture Systola1024 High speed Myrinet switch combines SIMD and MIMD paradigm within a parallel architecture  Hybrid Computer

Architecture of Systola 1024 Instruction Systolic Array: 32  32 mesh of processing elements wavefront instruction execution Interface processors ISA RAM NORTH host computer bus Controller RAM WEST program memory

SIMDs Embedded in SISDs Intel's Pentium 4 includes what they call MMX technology to gain a significant performance boost IBM and Motorola incorporated the technology into their G4 PowerPC chip in what they call their Velocity Engine. Both MMX technology and the Velocity Engine are the chip manufacturer's name for their proprietary SIMD processors and parallel extensions to their operating code. This same approach is used by NVidia and Evans & Sutherland to dramatically accelerate graphics rendering.

Special Purpose SIMDs in the Bioinformatics Arena Parcel Acquired by Celera Genomics in 2000 Products include the sequence supercomputer GeneMatcher, which has a high throughput sequence analysis capability Supported over a million processors earlier GeneMatcher was used by Celera in their race with U.S. government to complete the description of the human genome sequencing TimeLogic, Inc Has DeCypher, a reconfigurable SIMD

Advantages of SIMDs Reference: [Roosta, pg 10] Less hardware than MIMDs as they have only one control unit. Control units are complex. Less memory needed than MIMD Only one copy of the instructions need to be stored Allows more data to be stored in memory. Less startup time in communicating between PEs.

Advantages of SIMDs Single instruction stream and synchronization of PEs make SIMD applications easier to program, understand, & debug. Similar to sequential programming Control flow operations and scalar operations can be executed on the control unit while PEs are executing other instructions. MIMD architectures require explicit synchronization primitives, which create a substantial amount of additional overhead.

Advantages of SIMDs During a communication operation between PEs, PEs send data to a neighboring PE in parallel and in lock step No need to create a header with routing information as “routing” is determined by program steps. the entire communication operation is executed synchronously A tight (worst case) upper bound for the time for this operation can be computed. Less complex hardware in SIMD since no message decoder is needed in PEs MIMDs need a message decoder in each PE.

SIMD Shortcomings (with some rebuttals) Claims are from our textbook by Quinn. Similar statements are found in one of our “primary reference book” by Grama, et. al [13]. Claim 1: Not all problems are data-parallel While true, most problems seem to have data parallel solutions. In [Fox, et.al.], the observation was made in their study of large parallel applications that most were data parallel by nature, but often had points where significant branching occurred.

SIMD Shortcomings (with some rebuttals) Claim 2: Speed drops for conditionally executed branches Processors in both MIMD & SIMD normally have to do a significant amount of ‘condition’ testing MIMDs processors can execute multiple branches concurrently. For an if-then-else statement with execution times for the “then” and “else” parts being roughly equal, about ½ of the SIMD processors are idle during its execution With additional branching, the average number of inactive processors can become even higher. With SIMDs, only one of these branches can be executed at a time. This reason justifies the study of multiple SIMDs (or MSIMDs).

SIMD Shortcomings (with some rebuttals) Claim 2 (cont): Speed drops for conditionally executed code In [Fox, et.al.], the observation was made that for the real applications surveyed, the MAXIMUM number of active branches at any point in time was about 8. The cost of the extremely simple processors used in a SIMD are extremely low Programmers used to worry about ‘full utilization of memory’ but stopped this after memory cost became insignificant overall.

SIMD Shortcomings (with some rebuttals) Claim 3: Don’t adapt to multiple users well. This is true to some degree for all parallel computers. If usage of a parallel processor is dedicated to a important problem, it is probably best not to risk compromising its performance by ‘sharing’ This reason also justifies the study of multiple SIMDs (or MSIMD). SIMD architecture has not received the attention that MIMD has received and can greatly benefit from further research.

SIMD Shortcomings (with some rebuttals) Claim 4: Do not scale down well to “starter” systems that are affordable. This point is arguable and its ‘truth’ is likely to vary rapidly over time WorldScape/ClearSpeed currently sells a very economical SIMD board that plugs into a PC.

SIMD Shortcomings (with some rebuttals) Claim 5: Requires customized VLSI for processors and expense of control units has dropped Reliance on COTS (Commodity, off-the-shelf parts) has dropped the price of MIMDS Expense of PCs (with control units) has dropped significantly However, reliance on COTS has fueled the success of ‘low level parallelism’ provided by clusters and restricted new innovative parallel architecture research for well over a decade.

SIMD Shortcomings (with some rebuttals) Claim 5 (cont.) There is strong evidence that the period of continual dramatic increases in speed of PCs and clusters is ending. Continued rapid increases in parallel performance in the future will be necessary in order to solve important problems that are beyond our current capabilities Additionally, with the appearance of the very economical COTS SIMDs, this claim no longer appears to be relevant.