High-Performance Clusters part 1: Performance David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998.

Slides:



Advertisements
Similar presentations
Hardware & the Machine room Week 5 – Lecture 1. What is behind the wall plug for your workstation? Today we will look at the platform on which our Information.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Spring 2000CS 4611 Introduction Outline Statistical Multiplexing Inter-Process Communication Network Architecture Performance Metrics.
COS 461 Fall 1997 Workstation Clusters u replace big mainframe machines with a group of small cheap machines u get performance of big machines on the cost-curve.
Evolution of High Performance Cluster Architectures David E. Culler NPACI 2001 All Hands Meeting.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
Distributed Processing, Client/Server, and Clusters
Chapter 16 Client/Server Computing Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
NPACI Panel on Clusters David E. Culler Computer Science Division University of California, Berkeley
Trends in Cluster Architecture Steve Lumetta David Culler University of California at Berkeley Computer Science Division.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
NOW Finale Welcome June 1998 NOW Finale David E. Culler 6/15/98.
Communications in ISTORE Dan Hettena. Communication Goals Goals: Fault tolerance through redundancy Tolerate any single hardware failure High bandwidth.
Server Platforms Week 11- Lecture 1. Server Market $ 46,100,000,000 ($ 46.1 Billion) Gartner.
NOW 1 Berkeley NOW Project David E. Culler Sun Visit May 1, 1998.
IPPS 981 Berkeley FY98 Resource Working Group David E. Culler Computer Science Division U.C. Berkeley
Parallel Processing Architectures Laxmi Narayan Bhuyan
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
PRASHANTHI NARAYAN NETTEM.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Storage area network and System area network (SAN)
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
Chapter 2 Computer Clusters Lecture 2.1 Overview.
Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
1 Computing platform Andrew A. Chien Mohsen Saneei University of Tehran.
1 In Summary Need more computing power Improve the operating speed of processors & other components constrained by the speed of light, thermodynamic laws,
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
A brief overview about Distributed Systems Group A4 Chris Sun Bryan Maden Min Fang.
Global NetWatch Copyright © 2003 Global NetWatch, Inc. Factors Affecting Web Performance Getting Maximum Performance Out Of Your Web Server.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
1 Computer System Organization I/O systemProcessor Compiler Operating System (Windows 98) Application (Netscape) Digital Design Circuit Design Instruction.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
Comparison of Distributed Operating Systems. Systems Discussed ◦Plan 9 ◦AgentOS ◦Clouds ◦E1 ◦MOSIX.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
University of Mannheim1 ATOLL ATOmic Low Latency – A high-perfomance, low cost SAN Patrick R. Haspel Computer Architecture Group.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect Heiner Litz University of Heidelberg.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
SYSTEM MODELS FOR ADVANCED COMPUTING Jhashuva. U 1 Asst. Prof CSE
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Berkeley Cluster Projects
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Storage Networking.
Berkeley Cluster: Zoom Project
Storage Networking.
Chapter 17: Database System Architectures
Architectural Interactions in High Performance Clusters
Parallel Processing Architectures
Chapter 4 Multiprocessors
Presentation transcript:

High-Performance Clusters part 1: Performance David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998

7/28/98SPAA/PODC Clusters2 Clusters have Arrived … the SPAA / PDOC testbed going forward

7/28/98SPAA/PODC Clusters3 Berkeley NOW

7/28/98SPAA/PODC Clusters4 NOW’s Commercial Version 240 procesors, Active Messages, myrinet,...

7/28/98SPAA/PODC Clusters5 Berkeley Massive Storage Cluster serving Fine Art at or try

7/28/98SPAA/PODC Clusters6 Commercial Scene

7/28/98SPAA/PODC Clusters7 What’s a Cluster? Collection of independent computer systems working together as if a single system. Coupled through a scalable, high bandwidth, low latency interconnect.

7/28/98SPAA/PODC Clusters8 Outline for Part 1 Why Clusters NOW? What is the Key Challenge? How is it overcome? How much performance? Where is it going?

7/28/98SPAA/PODC Clusters9 Why Clusters? Capacity Availability Scalability Cost- effectiveness

7/28/98SPAA/PODC Clusters10 Traditional Availability Clusters VAX Clusters => IBM sysplex => Wolf Pack Clients Disk array A Disk array B InterconnectServer A Server B

7/28/98SPAA/PODC Clusters11 Why HP Clusters NOW? Time to market => performance Technology internet services Engineering Lag Time Node Performance in Large System

7/28/98SPAA/PODC Clusters12 Technology Breakthrough Killer micro => Killer switch single chip building block for scalable networks high bandwidth low latency very reliable

7/28/98SPAA/PODC Clusters13 Opportunity: Rethink System Design Remote memory and processor are closer than local disks! Networking Stacks ? Virtual Memory ? File system design ? It all looks like parallel programming Huge demand for scalable, available, dedicated internet servers –big I/O, big compute

7/28/98SPAA/PODC Clusters14 Example: Traditional File System Clients Server $$$ Global Shared File Cache RAID Disk Storage Fast Channel (HPPI) Expensive Complex Non-Scalable Single point of failure $ Local Private File Cache $$ ° ° ° Bottleneck Server resources at a premium Client resources poorly utilized

7/28/98SPAA/PODC Clusters15 Truly Distributed File System VM: page to remote memory File Cache P File Cache P File Cache P File Cache P File Cache P File Cache P File Cache P File Cache P Scalable Low-Latency Communication Network Network RAID striping G = Node Comm BW / Disk BW Local Cache Cluster Caching

7/28/98SPAA/PODC Clusters16 Fast Communication Challenge Fast processors and fast networks The time is spent in crossing between them Killer Switch ° ° ° Network Interface Hardware Comm.. Software Network Interface Hardware Comm. Software Network Interface Hardware Comm. Software Network Interface Hardware Comm. Software Killer Platform ns µs ms

7/28/98SPAA/PODC Clusters17 Opening: Intelligent Network Interfaces Dedicated Processing power and storage embedded in the Network Interface An I/O card today Tomorrow on chip? $ P M I/O bus (S-Bus) 50 MB/s Mryicom Net P Sun Ultra 170 Myricom NIC 160 MB/s M $ P M P $ P M $ P $ P M

7/28/98SPAA/PODC Clusters18 Our Attack: Active Messages Request / Reply small active messages (RPC) Bulk-Transfer (store & get) Highly optimized communication layer on a range of HW Request handler Reply

7/28/98SPAA/PODC Clusters19 NOW System Architecture Net Inter. HW UNIX Workstation Comm. SW Net Inter. HW Comm. SW Net Inter. HW Comm. SW Net Inter. HW Comm. SW Global Layer UNIX Resource Management Network RAM Distributed Files Process Migration Fast Commercial Switch (Myrinet) UNIX Workstation UNIX Workstation UNIX Workstation Large Seq. Apps Parallel Apps Sockets, Split-C, MPI, HPF, vSM

7/28/98SPAA/PODC Clusters20 Cluster Communication Performance

7/28/98SPAA/PODC Clusters21 LogP Interconnection Network MPMPMP ° ° ° P ( processors ) Limited Volume ( L/g to a proc) o (overhead) L (latency) o g (gap) Latency in sending a (small) message between modules overhead felt by the processor on sending or receiving msg gap between successive sends or receives (1/rate) Processors Round Trip time: 2 x ( 2o + L)

7/28/98SPAA/PODC Clusters22 LogP Comparison Direct, user-level network access Generic AM, FM (uiuc), PM (rwc), Unet (cornell), … Latency1/BW

7/28/98SPAA/PODC Clusters23 MPI over AM: ping-pong bandwidth

7/28/98SPAA/PODC Clusters24 MPI over AM: start-up

7/28/98SPAA/PODC Clusters25 Cluster Application Performance: NAS Parallel Benchmarks

7/28/98SPAA/PODC Clusters26 NPB2: NOW vs SP2

7/28/98SPAA/PODC Clusters27 NPB2: NOW vs SGI Origin

7/28/98SPAA/PODC Clusters28 Where the Time Goes: LU

7/28/98SPAA/PODC Clusters29 Where the time goes: SP

7/28/98SPAA/PODC Clusters30 LU Working Set 4-processor –traditional curve for small caches –Sharp knee >256KB (1 MB total)

7/28/98SPAA/PODC Clusters31 LU Working Set (CPS scaling) Knee at global cache > 1MB machine experiences drop in miss rate at specific size

7/28/98SPAA/PODC Clusters32 Application Sensitivity to Communication Performance

7/28/98SPAA/PODC Clusters33 Adjusting L, o, and g (and G) in situ Martin, et al., ISCA 97 Lanai Host Workstation  O: stall Ultra on msg write AM lib Lanai Host Workstation AM lib  g: delay Lanai after msg injection (after fragment for bulk transfers)  L: defer marking msg as valid until Rx +  L  O: stall Ultra on msg read Myrinet

7/28/98SPAA/PODC Clusters34 Calibration

7/28/98SPAA/PODC Clusters35 Split-C Applications Program InputP=16P=32(us) Msg Type Interval RadixInteger radix sort16M 32-bit keys msg EM3D(write)Electro-magnetic80K Nodes, 40% rmt write EM3D(read)Electro-magnetic 80K Nodes, 40% rmt read SampleInteger sample sort32M 32-bit keys msg BarnesHierarchical N-Body1 Million Bodies cached read P-RayRay Tracer1 Million pixel image cached read MurPHIProtocol VerificationSCI protocol, 2 proc Bulk ConnectConnected Comp4M nodes, 2-D mesh, 30% BSP NOW-sortDisk-to-Disk Sort32M 100-byte records I/O RadbBulk version Radix16M 32-bit keys Bulk

7/28/98SPAA/PODC Clusters36 Sensitivity to Overhead

7/28/98SPAA/PODC Clusters37 Comparative Impact

7/28/98SPAA/PODC Clusters38 Sensitivity to bulk BW (1/G)

7/28/98SPAA/PODC Clusters39 Cluster Communication Performance Overhead, Overhead, Overhead –hypersensitive due to increased serialization Sensitivity to gap reflects bursty communication Surprisingly latency tolerant Plenty of room for overhead improvement - How sensitive are distributed systems?

7/28/98SPAA/PODC Clusters40 Extrapolating to Low Overhead

7/28/98SPAA/PODC Clusters41 Direct Memory Messaging Send region and receive region for each end of communication channel Write through send region into remote rcv region

7/28/98SPAA/PODC Clusters42 Direct Memory Interconnects DEC Memory Channels –3 us end-to-end –~ 1us o, L SCI SGI Shrimp (Princeton) 100 MB/s

7/28/98SPAA/PODC Clusters43 Scalability, Availability, and Performance Scale disk, memory, proc independently Random node serves query, all search On (hw or sw) failure, lose random cols of index On overload, lose random rows PPPPPPPP Myrinet FE 100 Million Document Index Inktomi

7/28/98SPAA/PODC Clusters44 Summary Performance => Generality (see Part 2) From Technology “Shift” to Technology “Trend” Cluster communication becoming cheap –gigabit ethernet System Area Networks becoming commodity –Myricom OEM, Tandem/Compaq ServerNet, SGI, HAL, Sun Improvements in interconnect BW –gigabyte per second and beyond Bus connections improving –PCI, ePCI, Pentium II cluster slot, … Operating system out of the way –VIA

7/28/98SPAA/PODC Clusters45 Advice Clusters are cheap, easy to build, flexible, powerful, general purpose and fun Everybody doing SPAA or PODC should have one to try out their ideas Can use Berkeley NOW through npaci –