High Performance Cluster Computing Architectures and Systems

Slides:



Advertisements
Similar presentations
Distributed Processing, Client/Server and Clusters
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Distributed Systems CS
Distributed Systems Topics What is a Distributed System?
2. Computer Clusters for Scalable Parallel Computing
What is Grid Computing? Cevat Şener Dept. of Computer Engineering, METU.
High Performance Computing Course Notes Grid Computing.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Distributed Systems 1 Topics  What is a Distributed System?  Why Distributed Systems?  Examples of Distributed Systems  Distributed System Requirements.
Distributed Processing, Client/Server, and Clusters
Chapter 1: Introduction
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Grids and Grid Technologies for Wide-Area Distributed Computing Mark Baker, Rajkumar Buyya and Domenico Laforenza.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
DISTRIBUTED COMPUTING
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
Chapter 2 Computer Clusters Lecture 2.1 Overview.
CLUSTER COMPUTING Prepared by: Kalpesh Sindha (ITSNS)
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
1 In Summary Need more computing power Improve the operating speed of processors & other components constrained by the speed of light, thermodynamic laws,
Computer System Architectures Computer System Software
1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.
Building Scalable, High Performance Cluster and Grid Networks: The Role Of Ethernet Thriveni Movva CMPS 5433.
Nimrod/G GRID Resource Broker and Computational Economy David Abramson, Rajkumar Buyya, Jon Giddy School of Computer Science and Software Engineering Monash.
N. GSU Slide 1 Chapter 02 Cloud Computing Systems N. Xiong Georgia State University.
DISTRIBUTED COMPUTING
CLUSTER COMPUTING STIMI K.O. ROLL NO:53 MCA B-5. INTRODUCTION  A computer cluster is a group of tightly coupled computers that work together closely.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Fall 2000M.B. Ibáñez Lecture 01 Introduction What is an Operating System? The Evolution of Operating Systems Course Outline.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Types of Operating Systems
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
Tools for collaboration How to share your duck tales…
OS Services And Networking Support Juan Wang Qi Pan Department of Computer Science Southeastern University August 1999.
1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
Types of Operating Systems 1 Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
7. Grid Computing Systems and Resource Management
Server HW CSIS 4490 n-Tier Client/Server Dr. Hoganson Server Hardware Mission-critical –High reliability –redundancy Massive storage (disk) –RAID for redundancy.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-3. Clusters Classification Application Target ● High Performance (HP) Clusters ➢ Grand Challenging Applications.
Tackling I/O Issues 1 David Race 16 March 2010.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Background Computer System Architectures Computer System Software.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
CLIENT SERVER COMPUTING. We have 2 types of n/w architectures – client server and peer to peer. In P2P, each system has equal capabilities and responsibilities.
High Performance Cluster Computing Architectures and Systems Book Editor: Rajkumar Buyya Slides Prepared by: Hai Jin Internet and Cluster Computing Center.
SYSTEM MODELS FOR ADVANCED COMPUTING Jhashuva. U 1 Asst. Prof CSE
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
Presentation agenda Introduction.Background.Definition. Why it is? How it works? Applications Entry to Grid Adv. & Dis adv. Conclusion.
Chapter 16 Client/Server Computing Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Clouds , Grids and Clusters
Grid Computing.
University of Technology
Parallel and Multiprocessor Architectures – Shared Memory
CLUSTER COMPUTING.
LO2 – Understand Computer Software
Introduction To Distributed Systems
Presentation transcript:

High Performance Cluster Computing Architectures and Systems Book Editor: Rajkumar Buyya Slides Prepared by: Hai Jin Internet and Cluster Computing Center

Introduction Need more computing power Improve the operating speed of processors & other components constrained by the speed of light, thermodynamic laws, & the high financial costs for processor fabrication Connect multiple processors together & coordinate their computational efforts parallel computers allow the sharing of a computational task among multiple processors

Era of Computing Rapid technical advances Parallel computing the recent advances in VLSI technology software technology OS, PL, development methodologies, & tools grand challenge applications have become the main driving force Parallel computing one of the best ways to overcome the speed bottleneck of a single processor good price/performance ratio of a small cluster-based parallel computer

Need of more Computing Power: Grand Challenge Applications Solving technology problems using computer modeling, simulation and analysis Geographic Information Systems Aerospace Life Sciences CAD/CAM Digital Biology Military Applications

Parallel Computer Architectures Taxonomy based on how processors, memory & interconnect are laid out Massively Parallel Processors (MPP) Symmetric Multiprocessors (SMP) Cache-Coherent Nonuniform Memory Access (CC-NUMA) Distributed Systems Clusters Grids

Parallel Computer Architectures MPP A large parallel processing system with a shared-nothing architecture Consist of several hundred nodes with a high-speed interconnection network/switch Each node consists of a main memory & one or more processors Runs a separate copy of the OS SMP 2-64 processors today Shared-everything architecture All processors share all the global resources available Single copy of the OS runs on these systems

Parallel Computer Architectures CC-NUMA a scalable multiprocessor system having a cache-coherent nonuniform memory access architecture every processor has a global view of all of the memory Distributed systems considered conventional networks of independent computers have multiple system images as each node runs its own OS the individual machines could be combinations of MPPs, SMPs, clusters, & individual computers Clusters a collection of workstations of PCs that are interconnected by a high-speed network work as an integrated collection of resources have a single system image spanning all its nodes

Towards Low Cost Parallel Computing Parallel processing linking together 2 or more computers to jointly solve some computational problem since the early 1990s, an increasing trend to move away from expensive and specialized proprietary parallel supercomputers towards to cheaper, general purpose systems consisting of loosely coupled components built up from single or multiprocessor PCs or workstations the rapid improvement in the availability of commodity high performance components for workstations and networks  Low-cost commodity supercomputing need to standardization of many of the tools and utilities used by parallel applications (ex) MPI, HPF

Windows of Opportunities Parallel Processing Use multiple processors to build MPP/DSM-like systems for parallel computing Network RAM Use memory associated with each workstation as aggregate DRAM cache Software RAID Redundant array of inexpensive disks Use the arrays of workstation disks to provide cheap, highly available, & scalable file storage Possible to provide parallel I/O support to applications Use arrays of workstation disks to provide cheap, highly available, and scalable file storage Multipath Communication Use multiple networks for parallel data transfer between nodes

Cluster Computer and its Architecture A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource A node: a single or multiprocessor system with memory, I/O facilities, & OS generally 2 or more computers (nodes) connected together in a single cabinet, or physically separated & connected via a LAN appear as a single system to users and applications provide a cost-effective way to gain features and benefits

Cluster Computer Architecture

Prominent Components of Cluster Computers (I) Multiple High Performance Computers PCs Workstations SMPs (CLUMPS) Distributed HPC Systems leading to Metacomputing

Prominent Components of Cluster Computers (III) High Performance Networks/Switches Ethernet (10Mbps), Fast Ethernet (100Mbps), Gigabit Ethernet (1Gbps) SCI (Dolphin - MPI- 12micro-sec latency) ATM Myrinet (1.2Gbps) Digital Memory Channel FDDI

Prominent Components of Cluster Computers (V) Fast Communication Protocols and Services Active Messages (Berkeley) Fast Messages (Illinois) U-net (Cornell) XTP (Virginia)

Prominent Components of Cluster Computers (VII) Parallel Programming Environments and Tools Threads (PCs, SMPs, NOW..) POSIX Threads Java Threads MPI Linux, NT, on many Supercomputers PVM Software DSMs (TreadMark) Compilers C/C++/Java Parallel programming with C++ (MIT Press book) Debuggers Performance Analysis Tools Visualization Tools

Key Operational Benefits of Clustering High Performance Expandability and Scalability High Throughput High Availability

Clusters Classification (III) Node Hardware Clusters of PCs (CoPs) Piles of PCs (PoPs) Clusters of Workstations (COWs) Clusters of SMPs (CLUMPs)

Clusters Classification (V) Node Configuration Homogeneous Clusters All nodes will have similar architectures and run the same OSs Heterogeneous Clusters All nodes will have different architectures and run different OSs

Clusters Classification (VI) Levels of Clustering Group Clusters (#nodes: 2-99) Nodes are connected by SAN like Myrinet Departmental Clusters (#nodes: 10s to 100s) Organizational Clusters (#nodes: many 100s) National Metacomputers (WAN/Internet-based) International Metacomputers (Internet-based, #nodes: 1000s to many millions) Metacomputing Web-based Computing Agent Based Computing Java plays a major in web and agent based computing

Commodity Components for Clusters (III) Disk and I/O Overall improvement in disk access time has been less than 10% per year Amdahl’s law Speed-up obtained from faster processors is limited by the slowest system component Parallel I/O Carry out I/O operations in parallel, supported by parallel file system based on hardware or software RAID

What is Single System Image (SSI) ? A single system image is the illusion, created by software or hardware, that presents a collection of resources as one, more powerful resource. SSI makes the cluster appear like a single machine to the user, to applications, and to the network. A cluster without a SSI is not a cluster

Cluster Middleware & SSI Supported by a middleware layer that resides between the OS and user-level environment Middleware consists of essentially 2 sublayers of SW infrastructure SSI infrastructure Glue together OSs on all nodes to offer unified access to system resources System availability infrastructure Enable cluster services such as checkpointing, recovery from failure, & fault-tolerant support among all nodes of the cluster

Single System Image Benefits Provide a simple, straightforward view of all system resources and activities, from any node of the cluster Free the end user from having to know where an application will run Free the operator from having to know where a resource is located Let the user work with familiar interface and commands and allows the administrators to manage the entire clusters as a single entity Reduce the risk of operator errors, with the result that end users see improved reliability and higher availability of the system

Single System Image Benefits (Cont’d) Allowing centralize/decentralize system management and control to avoid the need of skilled administrators from system administration Present multiple, cooperating components of an application to the administrator as a single application Greatly simplify system management Provide location-independent message communication Help track the locations of all resource so that there is no longer any need for system operators to be concerned with their physical location Provide transparent process migration and load balancing across nodes. Improved system response time and performance

Resource Management and Scheduling (RMS) RMS is the act of distributing applications among computers to maximize their throughput Enable the effective and efficient utilization of the resources available Software components Resource manager Locating and allocating computational resource, authentication, process creation and migration Resource scheduler Queueing applications, resource location and assignment Reasons using RMS Provide an increased, and reliable, throughput of user applications on the systems Load balancing Utilizing spare CPU cycles Providing fault tolerant systems Manage access to powerful system, etc

Services provided by RMS Process Migration Computational resource has become too heavily loaded Fault tolerant concern Checkpointing Scavenging Idle Cycles 70% to 90% of the time most workstations are idle Fault Tolerance Minimization of Impact on Users Load Balancing Multiple Application Queues

Computing Platforms Evolution Breaking Administrative Barriers 2100 ? PERFORMANCE 2100 Administrative Barriers Individual Group Department Campus State National Globe Inter Planet Universe Desktop (Single Processor) SMPs or SuperComputers Local Cluster Enterprise Cluster/Grid Global Cluster/Grid Inter Planet Cluster/Grid ??

Why Do We Need Metacomputing? Our computational needs are infinite, whereas our financial resources are finite users will always want more & more powerful computers try & utilize the potentially hundreds of thousands of computers that are interconnected in some unified way need seamless access to remote resources

Towards Grid Computing….

What is Grid ? An infrastructure that couples Computers (PCs, workstations, clusters, traditional supercomputers, and even laptops, notebooks, mobile computers, PDA, and so on) … Software (e.g., renting expensive special purpose applications on demand) Databases (e.g., transparent access to human genome database) Special Instruments (e.g., radio telescope--SETI@Home Searching for Life in galaxy, Austrophysics@Swinburne for pulsars) People (may be even animals who knows ?) Across the Internet, presents them as an unified integrated (single) resource http://www.csse.monash.edu.au/~rajkumar/ecogrid/

Conceptual view of the Grid Leading to Portal (Super)Computing

Grid Application-Drivers Old and new applications getting enabled due to coupling of computers, databases, instruments, people, etc. (distributed) Supercomputing Collaborative engineering High-throughput computing large scale simulation & parameter studies Remote software access / Renting Software Data-intensive computing On-demand computing

The Grid Impact “The global computational grid is expected to drive the economy of the 21st century similar to the electric power grid that drove the economy of the 20th century”

Metacomputer Design Objectives and Issues (II) Underlying Hardware and Software Infrastructure A metacomputing environment must be able to operate on top of the whole spectrum of current and emerging HW & SW technology An ideal environment will provide access to the available resources in a seamless manner such that physical discontinuities such as difference between platforms, network protocols, and administrative boundaries become completely transparent

Metacomputer Design Objectives and Issues (III) Middleware – The Metacomputing Environment Communication services needs to support protocols that are used for bulk-data transport, streaming data, group communications, and those used by distributed objects Directory/registration services provide the mechanism for registering and obtaining information about the metacomputer structure, resources, services, and status Processes, threads, and concurrency control share data and maintain consistency when multiple processes or threads have concurrent access to it

Metacomputer Design Objectives and Issues (V) Middleware – The Metacomputing Environment Security and authorization confidentiality: prevent disclosure of data integrity: prevent tampering with data authorization: verify identity accountability: knowing whom to blame System status and fault tolerance Resource management and scheduling efficiently and effectively schedule the applications that need to utilize the available resource in the metacomputing environment

Metacomputer Design Objectives and Issues (VI) Middleware – The Metacomputing Environment Programming tools and paradigms include interface, APIs, and conversion tools so as to provide a rich development environment support a range of programming paradigms a suite of numerical and other commonly used libraries should be available User and administrative GUI intuitive and easy to use interface to the services and resources available Availability easily port on to a range of commonly used platforms, or use technologies that enable it to be platform neutral

Metacomputing Projects Globus (from Argonne National Laboratory) provides a toolkit on a set of existing components to build metacomputing environments Legion (from the University of Virginia) provides a high-level unified object model out of new and existing components to build a metasystem Webflow (from Syracuse University) provides a Web-based metacomputing environment

Globus (I) A computational grid A layered architecture A hardware and software infrastructure to provide dependable, consistent, and pervasive access to high-end computational capabilities, despite the geographical distribution of both resources and users A layered architecture high-level global services are built upon essential low-level core local services Globus Toolkit (GT) a central element of the Globus system defines the basic services and capabilities required to construct a computational grid consists of a set of components that implement basic services provides a bag of services only possible when the services are distinct and have well-defined interfaces (API)

Globus (II) Globus Alliance http://www.globus.org GT 3.0 GT 4.0 (2005) Resources management (GRAM) Information Service (MDS) Data Management (GridFTP) Security (GSI) GT 4.0 (2005) Execution management Information Services Data management Security Common runtime (WS)

The Impact of Metacomputing Metacomputing is an infrastructure that can bond and unify globally remote and diverse resources At some stage in the future, our computing needs will be satisfied in same pervasive and ubiquitous manner that we use the electricity power grid