CSE 160 – Lecture 2. Today’s Topics Flynn’s Taxonomy Bit-Serial, Vector, Pipelined Processors Interconnection Networks –Topologies –Routing –Embedding.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Today’s topics Single processors and the Memory Hierarchy
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
What is Flow Control ? Flow Control determines how a network resources, such as channel bandwidth, buffer capacity and control state are allocated to packet.
Parallel System Performance CS 524 – High-Performance Computing.
1 Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control.
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.
CSCI 8150 Advanced Computer Architecture

Communication operations Efficient Parallel Algorithms COMP308.
Parallel Computing Platforms
Chapter 17 Parallel Processing.
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
3. Interconnection Networks. Historical Perspective Early machines were: Collection of microprocessors. Communication was performed using bi-directional.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Parallel System Performance CS 524 – High-Performance Computing.
1 Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
Introduction to Parallel Processing Ch. 12, Pg
Storage area network and System area network (SAN)
Switching, routing, and flow control in interconnection networks.
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
1 Lecture 23: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm Next semester:
1 The Turn Model for Adaptive Routing. 2 Summary Introduction to Direct Networks. Deadlocks in Wormhole Routing. System Model. Partially Adaptive Routing.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
On-Chip Networks and Testing
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
CSC 364/664 Parallel Computation Fall 2003 Burg/Miller/Torgersen Chapter 1: Parallel Computers.
1 Lecture 7: Interconnection Network Part I: Basic Definitions Part II: Message Passing Multicomputers.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
Computer Architecture Distributed Memory MIMD Architectures Ola Flygt Växjö University
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Parallel Computing.
1 Basic Components of a Parallel (or Serial) Computer CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM.
Outline Why this subject? What is High Performance Computing?
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Super computers Parallel Processing
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix F)
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
1 Lecture 22: Interconnection Networks Topics: Routing, deadlock, flow control, virtual channels.
Processor Level Parallelism 1
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Distributed and Parallel Processing
Lecture 23: Interconnection Networks
Interconnection topologies
Course Outline Introduction in algorithms and applications
Switching, routing, and flow control in interconnection networks
Communication operations
Advanced Computer and Parallel Processing
Interconnection Networks Contd.
Lecture: Interconnection Networks
CS 6290 Many-core & Interconnect
Advanced Computer and Parallel Processing
Switching, routing, and flow control in interconnection networks
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

CSE 160 – Lecture 2

Today’s Topics Flynn’s Taxonomy Bit-Serial, Vector, Pipelined Processors Interconnection Networks –Topologies –Routing –Embedding Network Bisection

Taxonomy Flynn (1966) Classified machines by data and control streams Single Instruction Single Data (SISD) Single Instruction Multiple Data SIMD Multiple Instruction Single Data (MISD) Multiple Instruction Multiple Data (MIMD)

SIMD –All processors execute the same program in lockstep –Data that each processor sees is different –Single control processor –Individual processors can be turned on/off at each cycle –Illiac IV, CM-2, MasPar are some examples –Silicon Graphics Reality Graphics engine

MIMD All processors execute their own set of instructions Processors operate on separate datastreams No centralized clock implied SP-2, T3E, Clusters, Cray’s, etc.

SPMD/MPMD Single/Multiple Program Multiple Data SPMD processors run the same program but processors are necessarily run in lock step. Very popular and scalable programming style MPMD is similar except that different processors run different programs –PVM distribution has some simple examples

Processor Types Four types –Bit serial –Vector –Cache-based, pipelined –Custom (eg. Tera MTA or KSR-1)

Bit Serial Only seen in SIMD machines like CM-2 or MasPar Each clock cycle, one bit of the data is loaded/written –Simplifies memory system and memory trace count Popular for very dense (64K) processor arrays

Cache-based, Pipelined Garden Variety Microprocessor –Sparc, Intel x86, MC68xxx, MIPs, … –Register-based ALUs and FPUs –Registers are of scalar type Pipelined execution to improve performance of individual chips –Splits up components of basic operation like addition into stages –The more stages, the faster the speedup, but more problems with branching and data/control hazards Per-processor caches make it challenging to build SMPs (coherency issues) Now dominates the high-end market

Vector Processors Very specialized (eg. $$$$$) machines Registers are true vectors with power of 2 lengths Designed to efficiently perform matrix-style operations –Ax = b ( b(I) =  A(I,J)*x(J)) –Vector registers v1, v2, v3 V1 = A(I,*), V2 = b(*) MULV V3(I), V1, V2 “Chaining” to efficiently handle larger vectors than size of vector registers Cray, Hitachi, SGI (now Cray SV-1) are examples

Some Custom Processors Denelcor HEP/Tera MTA –Multiple register sets Stack Pointer, Instruction Pointer, Frame Pointer, etc. Facilitates hardware threads Switch each clock cycle to different register set –Why? Stalls to memory subsystem in one thread can be hidden by concurrency KSR-1 –Cache-only memory processor –Basically 2 generations behind standard micros

Going Parallel Late 70’s, even vector “monsters” started to to go parallel For //-processing to work, individual processors must synchronize –SIMD – Synchronize every clock cycle –MIMD – Explicit sychronization Message passing Semaphores, monitors, fetch-and-increment –Focus on interconnection networks for rest of lecture

Characterizing Networks Bandwidth Device/switch latency Switching types –Circuit switched (eg. Telephone) –Packet switched (eg. Internet) Store and forward Virtual Cut Through Wormhole routed Topology –Number of connections –Diameter (how many hops through switches)

Latency Latency is the amount of time taken for a command to start before any effect is seen –Push on gas pedal before car goes forward –Time you enter a line, before cashier starts on your job –First bit leaves computer A, first bit arrives at computer B OR –(Message latency) First bit leaves computer A, last bit arrives at computer B Startup latency is the amount of time to send a zero length message

Bandwidth Bits/second that can travel through a connection A really simple model for calculating the time to send a message of N bytes –Time = latency + N/bandwidth Bisection is the minimum number of wires that must be cut to divide a network of machines into two equal halves. Bisection bandwidth is the total bandwidth through the bisection

Completely connected –Every node has a direct wire connection to every other node (N x (N-1))/2 Wires, Clearly impractical Interconnection Topologies

Line/Ring Simple interconnection First topology where routing is an issue Needed when no direct connection exists between nodes Want go to node 4 from node 2 have to pass through node 3 What happens if 2 want to communicate with 3 at the same time 1 want to communicate with 4? What is the bisection of a line/ring If the links are of bandwidth B, what is the bisection bandwidth What is the aggregate bandwidth of the network?

Generalization of line/ring to multiple dimensions More routes between nodes What is the bisection of this network? Mesh/Torus

Hop Count Networks are measured by diameter –This is the minimum number of hops that message must traverse for the two nodes that furthest apart –Line: Diameter = N-1 –2D (NxM) Mesh: Diameter = N+M-2

Tree-based Networks Nodes organized in a tree fashion (important for some global algorithms) Diameter of this network? Bisection, Bisection Bandwidth?

Hypercubes 1D 2D 3D 4D

Hypercubes 2 Dimension N Hypercube is constructed by connecting the “corners” of two N-1 hypercubes Relatively low wire count to build large networks Multiple routes from any destination to any node. Exercise to the reader, what is the dimenision of a K-dimensional Hypercube

Labeling/Routing in a Hypercube Nodes a labeled in Gray Code –Connected neighbors have their binary node number representation differ by one bit. 3D cube

The e-cube routing algorithm Source address S = S 0 S 1 S 2 … S n Destination address D = D 0 D 1 D 2 … D n Let R = R 0 R 1 R 2 … R n = S  R Number of one bits in R indicate distance between S and D Starting at S, go to neighbor where first Rj = 1 (if Sj = 0 then goto neighbor where Sj=1) Continue routing from this intermediate node where the next Rk (k > j) is one, goto that neighbor.

E-cube routing example 8 Dimensional Hypercube (256 Nodes) S = 134= 0x86 = D = 215 = 0xD7 = S  D = 0x51 = –Distance = 3 S  (198)  (214)  (215)

Embedding A network is embeddable if nodes and links can be mapped to a target network A mesh is embeddable in a hypercube –There is mapping of hypercube nodes and networks to a mesh The dilation of an embedding is how many links are needed in the embedding network to represent the embedded network –Perfect embeddings have dilation 1 Embedding a tree into a mesh has a dilation of 2 (See example in book)

Modern Parallel Machines are Packet Switched Break message into smaller blocks and send these pieces through the network Network intermediate points (routers) can be store-and-forward or virtual cut through –Store and forward requires buffering at each switch if an incoming packet has packets ahead of it on an outgoing port (congestion) –Virtual cut-through eliminates the always buffering for store and forward by “cutting through” the switch when the output port is free

Wormhole Routing Wormhole routing is a variation of virtual cut through –Small headers (flow control digits == Flits) pass through the network. –When a flit is allowed to cut through a switch, the original sender is guaranteed a clear path through that switch. –A tail flit closes the “connection” Wormhole was defined by Seitz and is used in Myrinet, a very popular cluster interconnect.

Latency of Circuit Switched and Virtual Cut Through Circuit Switch Latency –(L c /B) l + (L/B) L c = length of control packet B = bandwidth l = number of links L = Length of Packet Virtual Cut-through latency –(L h /B) l + (L/B) L h = length of header packet

Store-Forward and Wormhole routing Latency –Wormhole Routing Latency (L f /B) l + (L/B) –L f = Length of flit –Store-Forward Latency (L/B) l –Store and forward latency can be much worse for many hops. –Virtual Cut Through, Wormhole, and Circuit Switch reach (L/B) as message length increases

Deadlock/Livelock Livelock/Deadlock is a potential problem in any network design. Livelock occurs in adaptive routing algorithms when a packet never finds destination Deadlock occurs when packets cannot be forwarded because waiting for other packets to move out of the way. Blocking packet is waiting for blocked packet to move

Next Time … All about clusters Introduction to PVM (and MPI)