Scalability CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Slides:



Advertisements
Similar presentations
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Advertisements

Graduate Computer Architecture, Fall 2005 Lecture 10 Distributed Memory Multiprocessors Shih-Hao Hung Computer Science & Information Engineering National.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
Realizing Programming Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
ECE669 L20: Evaluation and Message Passing April 13, 2004 ECE 669 Parallel Computer Architecture Lecture 20 Evaluation and Message Passing.
EECC756 - Shaaban #1 lec # 13 Spring Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands.
Scalable Distributed Memory Multiprocessors Todd C. Mowry CS 495 October 24 & 29, 2002.
EECC756 - Shaaban #1 lec # 10 Spring Parallel System Performance: Evaluation & Scalability Factors affecting parallel system performance:
EECC756 - Shaaban #1 lec # 12 Spring Scalable Distributed Memory Machines Goal: Parallel machines that can be scaled to hundreds or thousands.
1 Interfacing Processors and Peripherals I/O Design affected by many factors (expandability, resilience) Performance: — access latency — throughput — connection.
Distributed Memory Multiprocessors CS 252, Spring 2005 David E. Culler Computer Science Division U.C. Berkeley.
General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
7/2/2015 slide 1 PCOD: Scalable Parallelism (ICs) Per Stenström (c) 2008, Sally A. McKee (c) 2011 Scalable Multiprocessors What is a scalable design? (7.1)
Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
CS 258 Parallel Computer Architecture Lecture 8 Network Interface Design February 20, 2008 Prof John D. Kubiatowicz
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Storage area network and System area network (SAN)
Switching, routing, and flow control in interconnection networks.
Lecture 1, 1Spring 2003, COM1337/3501Computer Communication Networks Rajmohan Rajaraman COM1337/3501 Textbook: Computer Networks: A Systems Approach, L.
Computer Networks Switching Professor Hui Zhang
CS-334: Computer Architecture
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
1 Token Passing: IEEE802.5 standard  4 Mbps  maximum token holding time: 10 ms, limiting packet length  packet (token, data) format:  SD, ED mark start,
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
Synchronization and Communication in the T3E Multiprocessor.
CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION
TO p. 1 Spring 2006 EE 5304/EETS 7304 Internet Protocols Tom Oh Dept of Electrical Engineering Lecture 9 Routers, switches.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 Next Few Classes Networking basics Protection & Security.
Token Passing: IEEE802.5 standard  4 Mbps  maximum token holding time: 10 ms, limiting packet length  packet (token, data) format:  SD, ED mark start,
Top Level View of Computer Function and Interconnection.
TELE202 Lecture 5 Packet switching in WAN 1 Lecturer Dr Z. Huang Overview ¥Last Lectures »C programming »Source: ¥This Lecture »Packet switching in Wide.
Data and Computer Communications Circuit Switching and Packet Switching.
Computer Networks with Internet Technology William Stallings
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
August 1, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 9: I/O Devices and Communication Buses * Jeremy R. Johnson Wednesday,
Graduate Computer Architecture I Lecture 11: Distribute Memory Multiprocessors.
Computer Hardware A computer is made of internal components Central Processor Unit Internal External and external components.
By Fernan Naderzad.  Today we’ll go over: Von Neumann Architecture, Hardware and Software Approaches, Computer Functions, Interrupts, and Buses.
Interconnection network network interface and a case study.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
CCNA3 Module 4 Brierley Module 4. CCNA3 Module 4 Brierley Topics LAN congestion and its effect on network performance Advantages of LAN segmentation in.
CS252 Graduate Computer Architecture Lecture 24 Network Interface Design Memory Consistency Models Prof John D. Kubiatowicz
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
1 Switching and Forwarding Sections Connecting More Than Two Hosts Multi-access link: Ethernet, wireless –Single physical link, shared by multiple.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Advanced Computer Networks
CMSC 611: Advanced Computer Architecture
Architecture of Parallel Computers CSC / ECE 506 Summer 2006 Scalable Programming Models Lecture 11 6/19/2006 Dr Steve Hunter.
Introduction to Scalable Interconnection Network Design
Switching, routing, and flow control in interconnection networks
Computer Science Division
Computer Science Division
Overview of Computer Architecture and Organization
Overview of Computer Architecture and Organization
Latency Tolerance: what to do when it just won’t go away
Computer Science Division
Presentation transcript:

Scalability CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley

3/3/99CS258 S992 Recap: Gigaplane Bus Timing

3/3/99CS258 S993 Enterprise Processor and Memory System 2 procs per board, external L2 caches, 2 mem banks with x-bar Data lines buffered through UDB to drive internal 1.3 GB/s UPA bus Wide path to memory so full 64-byte line in 1 mem cycle (2 bus cyc) Addr controller adapts proc and bus protocols, does cache coherence –its tags keep a subset of states needed by bus (e.g. no M/E distinction)

3/3/99CS258 S994 Enterprise I/O System I/O board has same bus interface ASICs as processor boards But internal bus half as wide, and no memory path Only cache block sized transactions, like processing boards –Uniformity simplifies design –ASICs implement single-block cache, follows coherence protocol Two independent 64-bit, 25 MHz Sbuses –One for two dedicated FiberChannel modules connected to disk –One for Ethernet and fast wide SCSI –Can also support three SBUS interface cards for arbitrary peripherals Performance and cost of I/O scale with no. of I/O boards

3/3/99CS258 S995 Limited Scaling of a Bus Bus: each level of the system design is grounded in the scaling limits at the layers below and assumptions of close coupling between components CharacteristicBus Physical Length~ 1 ft Number of Connectionsfixed Maximum Bandwidthfixed Interface to Comm. mediummemory inf Global Orderarbitration ProtectionVirt -> physical Trusttotal OSsingle comm. abstractionHW

3/3/99CS258 S996 Workstations in a LAN? No clear limit to physical scaling, little trust, no global order, consensus difficult to achieve. Independent failure and restart CharacteristicBusLAN Physical Length~ 1 ftKM Number of Connectionsfixedmany Maximum Bandwidthfixed??? Interface to Comm. mediummemory infperipheral Global Orderarbitration??? ProtectionVirt -> physicalOS Trusttotalnone OSsingleindependent comm. abstractionHWSW

3/3/99CS258 S997 Scalable Machines What are the design trade-offs for the spectrum of machines between? –specialize or commodity nodes? –capability of node-to-network interface –supporting programming models? What does scalability mean? –avoids inherent design limits on resources –bandwidth increases with P –latency does not –cost increases slowly with P

3/3/99CS258 S998 Bandwidth Scalability What fundamentally limits bandwidth? –single set of wires Must have many independent wires Connect modules through switches Bus vs Network Switch?

3/3/99CS258 S999 Dancehall MP Organization Network bandwidth? Bandwidth demand? –independent processes? –communicating processes? Latency?

3/3/99CS258 S9910 Generic Distributed Memory Org. Network bandwidth? Bandwidth demand? –independent processes? –communicating processes? Latency?

3/3/99CS258 S9911 Key Property Large number of independent communication paths between nodes => allow a large number of concurrent transactions using different wires initiated independently no global arbitration effect of a transaction only visible to the nodes involved –effects propagated through additional transactions

3/3/99CS258 S9912 Latency Scaling T(n) = Overhead + Channel Time + Routing Delay Overhead? Channel Time(n) = n/B --- BW at bottleneck RoutingDelay(h,n)

3/3/99CS258 S9913 Typical example max distance: log n number of switches:  n log n overhead = 1 us, BW = 64 MB/s, 200 ns per hop Pipelined T 64 (128) = 1.0 us us + 6 hops * 0.2 us/hop = 4.2 us T 1024 (128) = 1.0 us us + 10 hops * 0.2 us/hop = 5.0 us Store and Forward T 64 sf (128) = 1.0 us + 6 hops * ( ) us/hop = 14.2 us T 64 sf (1024) = 1.0 us + 10 hops * ( ) us/hop = 23 us

3/3/99CS258 S9914 Cost Scaling cost(p,m) = fixed cost + incremental cost (p,m) Bus Based SMP? Ratio of processors : memory : network : I/O ? Parallel efficiency(p) = Speedup(P) / P Costup(p) = Cost(p) / Cost(1) Cost-effective: speedup(p) > costup(p) Is super-linear speedup

3/3/99CS258 S9915 Cost Effective? 2048 processors: 475 fold speedup at 206x cost

3/3/99CS258 S9916 Physical Scaling Chip-level integration Board-level System level

3/3/99CS258 S9917 nCUBE/2 Machine Organization Entire machine synchronous at 40 MHz Single-chip node Basic module Hypercube network configuration DRAM interface D M A c h a n n e l s R o u t e r MMU I-Fetch & decode 64-bit integer IEEE floating point Operand $ Execution unit 1024 Nodes

3/3/99CS258 S9918 CM-5 Machine Organization

3/3/99CS258 S9919 System Level Integration

3/3/99CS258 S9920 Realizing Programming Models CAD MultiprogrammingShared address Message passing Data parallel DatabaseScientific modeling Parallel applications Programming models Communication abstraction User/system boundary Compilation or library Operating systems support Communication hardware Physical communication medium Hardware/software boundary

3/3/99CS258 S9921 Network Transaction Primitive one-way transfer of information from a source output buffer to a dest. input buffer –causes some action at the destination –occurrence is not directly visible at source deposit data, state change, reply

3/3/99CS258 S9922 Bus Transactions vs Net Transactions Issues: protection checkV->P?? formatwiresflexible output bufferingreg, FIFO?? media arbitrationgloballocal destination naming and routing input bufferinglimitedmany source action completion detection

3/3/99CS258 S9923 Shared Address Space Abstraction fixed format, request/response, simple action

3/3/99CS258 S9924 Consistency is challenging

3/3/99CS258 S9925 Synchronous Message Passing

3/3/99CS258 S9926 Asynch. Message Passing: Optimistic Storage???

3/3/99CS258 S9927 Asynch. Msg Passing: Conservative

3/3/99CS258 S9928 Active Messages User-level analog of network transaction Action is small user function Request/Reply May also perform memory-to-memory transfer Request handler Reply

3/3/99CS258 S9929 Common Challenges Input buffer overflow –N-1 queue over-commitment => must slow sources –reserve space per source(credit) »when available for reuse? Ack or Higher level –Refuse input when full »backpressure in reliable network »tree saturation »deadlock free »what happens to traffic not bound for congested dest? –Reserve ack back channel –drop packets

3/3/99CS258 S9930 Challenges (cont) Fetch Deadlock –For network to remain deadlock free, nodes must continue accepting messages, even when cannot source them –what if incoming transaction is a request? »Each may generate a response, which cannot be sent! »What happens when internal buffering is full? logically independent request/reply networks –physical networks –virtual channels with separate input/output queues bound requests and reserve input buffer space –K(P-1) requests + K responses per node –service discipline to avoid fetch deadlock? NACK on input buffer full –NACK delivery?

3/3/99CS258 S9931 Summary Scalability –physical, bandwidth, latency and cost –level of integration Realizing Programming Models –network transactions –protocols –safety »N-1 »fetch deadlock Next: Communication Architecture Design Space –how much hardware interpretation of the network transaction?