Cluster Computing with Linux Prabhaker Mateti Wright State University.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Threads, SMP, and Microkernels
Distributed Processing, Client/Server and Clusters
1 Introduction to Cluster Computing Prabhaker Mateti Wright State University Dayton, Ohio, USA.
Chap 2 System Structures.
Operating-System Structures
Introduction CSCI 444/544 Operating Systems Fall 2008.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Computer Systems/Operating Systems - Class 8
Distributed Processing, Client/Server, and Clusters
Networks of Workstations Prabhaker Mateti Wright State University.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Memory Management 2010.
3.5 Interprocess Communication
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
Linux clustering Morris Law, IT Coordinator, Science Faculty, Hong Kong Baptist University.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.
Computer System Architectures Computer System Software
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 2: System Structures.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Operating Systems CS3502 Fall 2014 Dr. Jose M. Garrido
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
OS provide a user-friendly environment and manage resources of the computer system. Operating systems manage: –Processes –Memory –Storage –I/O subsystem.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
Chapter 2: Operating-System Structures. 2.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 2: Operating-System Structures Operating.
Cluster Workstations. Recently the distinction between parallel and distributed computers has become blurred with the advent of the network of workstations.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Multiprocessor and Real-Time Scheduling Chapter 10.
Chapter 101 Multiprocessor and Real- Time Scheduling Chapter 10.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
1 SIAC 2000 Program. 2 SIAC 2000 at a Glance AMLunchPMDinner SunCondor MonNOWHPCGlobusClusters TuePVMMPIClustersHPVM WedCondorHPVM.
Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
CS4315A. Berrached:CMS:UHD1 Introduction to Operating Systems Chapter 1.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
CEG 2400 FALL 2012 Linux/UNIX Network Operating Systems.
Background Computer System Architectures Computer System Software.
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Introduction to Cluster Computing
2. OPERATING SYSTEM 2.1 Operating System Function
OpenMosix, Open SSI, and LinuxPMI
Definition of Distributed System
Processes and Threads Processes and their scheduling
Chapter 2: System Structures
Introduction to Operating System (OS)
University of Technology
Threads, SMP, and Microkernels
Chapter 2: System Structures
Multiple Processor Systems
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Presentation transcript:

Cluster Computing with Linux Prabhaker Mateti Wright State University

Mateti, Linux Clusters2 Abstract Cluster computing distributes the computational load to collections of similar machines. This talk describes what cluster computing is, the typical Linux packages used, and examples of large clusters in use today. This talk also reviews cluster computing modifications of the Linux kernel. Cluster computing distributes the computational load to collections of similar machines. This talk describes what cluster computing is, the typical Linux packages used, and examples of large clusters in use today. This talk also reviews cluster computing modifications of the Linux kernel.

Mateti, Linux Clusters3 What Kind of Computing, did you say? Sequential Sequential Concurrent Concurrent Parallel Parallel Distributed Distributed Networked Networked Migratory Migratory Cluster Cluster Grid Grid Pervasive Pervasive Quantum Quantum Optical Optical Molecular Molecular

Fundamentals Overview

Mateti, Linux Clusters5 Fundamentals Overview Granularity of Parallelism Granularity of Parallelism Synchronization Synchronization Message Passing Message Passing Shared Memory Shared Memory

Mateti, Linux Clusters6 Granularity of Parallelism Fine-Grained Parallelism Fine-Grained Parallelism Medium-Grained Parallelism Medium-Grained Parallelism Coarse-Grained Parallelism Coarse-Grained Parallelism NOWs (Networks of Workstations) NOWs (Networks of Workstations)

Mateti, Linux Clusters7 Fine-Grained Machines Tens of thousands of Processor Elements Tens of thousands of Processor Elements Processor Elements Processor Elements Slow (bit serial) Slow (bit serial) Small Fast Private RAM Small Fast Private RAM Shared Memory Shared Memory Interconnection Networks Interconnection Networks Message Passing Message Passing Single Instruction Multiple Data (SIMD) Single Instruction Multiple Data (SIMD)

Mateti, Linux Clusters8 Medium-Grained Machines Typical Configurations Typical Configurations Thousands of processors Thousands of processors Processors have power between coarse- and fine-grained Processors have power between coarse- and fine-grained Either shared or distributed memory Either shared or distributed memory Traditionally: Research Machines Traditionally: Research Machines Single Code Multiple Data (SCMD) Single Code Multiple Data (SCMD)

Mateti, Linux Clusters9 Coarse-Grained Machines Typical Configurations Typical Configurations Hundreds/Thousands of Processors Hundreds/Thousands of Processors Processors Processors Powerful (fast CPUs) Powerful (fast CPUs) Large (cache, vectors, multiple fast buses) Large (cache, vectors, multiple fast buses) Memory: Shared or Distributed-Shared Memory: Shared or Distributed-Shared Multiple Instruction Multiple Data (MIMD) Multiple Instruction Multiple Data (MIMD)

Mateti, Linux Clusters10 Networks of Workstations Exploit inexpensive Workstations/PCs Exploit inexpensive Workstations/PCs Commodity network Commodity network The NOW becomes a “distributed memory multiprocessor” The NOW becomes a “distributed memory multiprocessor” Workstations send+receive messages Workstations send+receive messages C and Fortran programs with PVM, MPI, etc. libraries C and Fortran programs with PVM, MPI, etc. libraries Programs developed on NOWs are portable to supercomputers for production runs Programs developed on NOWs are portable to supercomputers for production runs

Mateti, Linux Clusters11 Definition of “Parallel” S1 begins at time b1, ends at e1 S1 begins at time b1, ends at e1 S2 begins at time b2, ends at e2 S2 begins at time b2, ends at e2 S1 || S2 S1 || S2 Begins at min(b1, b2) Begins at min(b1, b2) Ends at max(e1, e2) Ends at max(e1, e2) Commutative (Equiv to S2 || S1) Commutative (Equiv to S2 || S1)

Mateti, Linux Clusters12 Data Dependency x := a + b; y := c + d; x := a + b; y := c + d; x := a + b || y := c + d; x := a + b || y := c + d; y := c + d; x := a + b; y := c + d; x := a + b; X depends on a and b, y depends on c and d X depends on a and b, y depends on c and d Assumed a, b, c, d were independent Assumed a, b, c, d were independent

Mateti, Linux Clusters13 Types of Parallelism Result: Data structure can be split into parts of same structure. Result: Data structure can be split into parts of same structure. Specialist: Each node specializes. Pipelines. Specialist: Each node specializes. Pipelines. Agenda: Have list of things to do. Each node can generalize. Agenda: Have list of things to do. Each node can generalize.

Mateti, Linux Clusters14 Result Parallelism Also called Also called Embarrassingly Parallel Embarrassingly Parallel Perfect Parallel Perfect Parallel Computations that can be subdivided into sets of independent tasks that require little or no communication Computations that can be subdivided into sets of independent tasks that require little or no communication Monte Carlo simulations Monte Carlo simulations F(x, y, z) F(x, y, z)

Mateti, Linux Clusters15 Specialist Parallelism Different operations performed simultaneously on different processors Different operations performed simultaneously on different processors E.g., Simulating a chemical plant; one processor simulates the preprocessing of chemicals, one simulates reactions in first batch, another simulates refining the products, etc. E.g., Simulating a chemical plant; one processor simulates the preprocessing of chemicals, one simulates reactions in first batch, another simulates refining the products, etc.

Mateti, Linux Clusters16 Agenda Parallelism: MW Model Manager Manager Initiates computation Initiates computation Tracks progress Tracks progress Handles worker’s requests Handles worker’s requests Interfaces with user Interfaces with user Workers Workers Spawned and terminated by manager Spawned and terminated by manager Make requests to manager Make requests to manager Send results to manager Send results to manager

Mateti, Linux Clusters17 Embarrassingly Parallel Result Parallelism is obvious Result Parallelism is obvious Ex1: Compute the square root of each of the million numbers given. Ex1: Compute the square root of each of the million numbers given. Ex2: Search for a given set of words among a billion web pages. Ex2: Search for a given set of words among a billion web pages.

Mateti, Linux Clusters18 Reduction Combine several sub-results into one Combine several sub-results into one Reduce r1 r2 … rn with op Reduce r1 r2 … rn with op Becomes r1 op r2 op … op rn Becomes r1 op r2 op … op rn Hadoop is based on this idea Hadoop is based on this idea

Mateti, Linux Clusters19 Shared Memory Process A writes to a memory location Process A writes to a memory location Process B reads from that memory location Process B reads from that memory location Synchronization is crucial Synchronization is crucial Excellent speed Excellent speed Semantics … ? Semantics … ?

Mateti, Linux Clusters20 Shared Memory Needs hardware support: Needs hardware support: multi-ported memory multi-ported memory Atomic operations: Atomic operations: Test-and-Set Test-and-Set Semaphores Semaphores

Mateti, Linux Clusters21 Shared Memory Semantics: Assumptions Global time is available. Discrete increments. Global time is available. Discrete increments. Shared variable, s = vi at ti, i=0,… Shared variable, s = vi at ti, i=0,… Process A: s := v1 at time t1 Process A: s := v1 at time t1 Assume no other assignment occurred after t1. Assume no other assignment occurred after t1. Process B reads s at time t and gets value v. Process B reads s at time t and gets value v.

Mateti, Linux Clusters22 Shared Memory: Semantics Value of Shared Variable Value of Shared Variable v = v1, if t > t1 v = v1, if t > t1 v = v0, if t < t1 v = v0, if t < t1 v = ??, if t = t1 v = ??, if t = t1 t = t1 +- discrete quantum t = t1 +- discrete quantum Next Update of Shared Variable Next Update of Shared Variable Occurs at t2 Occurs at t2 t2 = t1 + ? t2 = t1 + ?

Mateti, Linux Clusters23 Distributed Shared Memory “Simultaneous” read/write access by spatially distributed processors “Simultaneous” read/write access by spatially distributed processors Abstraction layer of an implementation built from message passing primitives Abstraction layer of an implementation built from message passing primitives Semantics not so clean Semantics not so clean

Mateti, Linux Clusters24 Semaphores Semaphore s; Semaphore s; V(s) ::=  s := s + 1  V(s) ::=  s := s + 1  P(s) ::=  when s > 0 do s := s – 1  P(s) ::=  when s > 0 do s := s – 1  Deeply studied theory.

Mateti, Linux Clusters25 Condition Variables Condition C; Condition C; C.wait() C.wait() C.signal() C.signal()

Mateti, Linux Clusters26 Distributed Shared Memory A common address space that all the computers in the cluster share. A common address space that all the computers in the cluster share. Difficult to describe semantics. Difficult to describe semantics.

Mateti, Linux Clusters27 Distributed Shared Memory: Issues Distributed Distributed Spatially Spatially LAN LAN WAN WAN No global time available No global time available

Mateti, Linux Clusters28 Distributed Computing No shared memory No shared memory Communication among processes Communication among processes Send a message Send a message Receive a message Receive a message Asynchronous Asynchronous Synchronous Synchronous Synergy among processes Synergy among processes

Mateti, Linux Clusters29 Messages Messages are sequences of bytes moving between processes Messages are sequences of bytes moving between processes The sender and receiver must agree on the type structure of values in the message The sender and receiver must agree on the type structure of values in the message “Marshalling”: data layout so that there is no ambiguity such as “four chars” v. “one integer”. “Marshalling”: data layout so that there is no ambiguity such as “four chars” v. “one integer”.

Mateti, Linux Clusters30 Message Passing Process A sends a data buffer as a message to process B. Process A sends a data buffer as a message to process B. Process B waits for a message from A, and when it arrives copies it into its own local memory. Process B waits for a message from A, and when it arrives copies it into its own local memory. No memory shared between A and B. No memory shared between A and B.

Mateti, Linux Clusters31 Message Passing Obviously, Obviously, Messages cannot be received before they are sent. Messages cannot be received before they are sent. A receiver waits until there is a message. A receiver waits until there is a message. Asynchronous Asynchronous Sender never blocks, even if infinitely many messages are waiting to be received Sender never blocks, even if infinitely many messages are waiting to be received Semi-asynchronous is a practical version of above with large but finite amount of buffering Semi-asynchronous is a practical version of above with large but finite amount of buffering

Mateti, Linux Clusters32 Message Passing: Point to Point Q: send(m, P) Q: send(m, P) Send message M to process P Send message M to process P P: recv(x, Q) P: recv(x, Q) Receive message from process Q, and place it in variable x Receive message from process Q, and place it in variable x The message data The message data Type of x must match that of m Type of x must match that of m As if x := m As if x := m

Mateti, Linux Clusters33 Broadcast One sender Q, multiple receivers P One sender Q, multiple receivers P Not all receivers may receive at the same time Not all receivers may receive at the same time Q: broadcast (m) Q: broadcast (m) Send message M to processes Send message M to processes P: recv(x, Q) P: recv(x, Q) Receive message from process Q, and place it in variable x Receive message from process Q, and place it in variable x

Mateti, Linux Clusters34 Synchronous Message Passing Sender blocks until receiver is ready to receive. Sender blocks until receiver is ready to receive. Cannot send messages to self. Cannot send messages to self. No buffering. No buffering.

Mateti, Linux Clusters35 Asynchronous Message Passing Sender never blocks. Sender never blocks. Receiver receives when ready. Receiver receives when ready. Can send messages to self. Can send messages to self. Infinite buffering. Infinite buffering.

Mateti, Linux Clusters36 Message Passing Speed not so good Speed not so good Sender copies message into system buffers. Sender copies message into system buffers. Message travels the network. Message travels the network. Receiver copies message from system buffers into local memory. Receiver copies message from system buffers into local memory. Special virtual memory techniques help. Special virtual memory techniques help. Programming Quality Programming Quality less error-prone cf. shared memory less error-prone cf. shared memory

Mateti, Linux Clusters37 Computer Architectures

Mateti, Linux Clusters38 Architectures of Top 500 Sys

Mateti, Linux Clusters39 Architectures of Top 500 Sys

Mateti, Linux Clusters40

Mateti, Linux Clusters41 “Parallel” Computers Traditional supercomputers Traditional supercomputers SIMD, MIMD, pipelines SIMD, MIMD, pipelines Tightly coupled shared memory Tightly coupled shared memory Bus level connections Bus level connections Expensive to buy and to maintain Expensive to buy and to maintain Cooperating networks of computers Cooperating networks of computers

Mateti, Linux Clusters42 Traditional Supercomputers Very high starting cost Very high starting cost Expensive hardware Expensive hardware Expensive software Expensive software High maintenance High maintenance Expensive to upgrade Expensive to upgrade

Mateti, Linux Clusters43 Computational Grids “Grids are persistent environments that enable software applications to integrate instruments, displays, computational and information resources that are managed by diverse organizations in widespread locations.” “Grids are persistent environments that enable software applications to integrate instruments, displays, computational and information resources that are managed by diverse organizations in widespread locations.”

Mateti, Linux Clusters44 Computational Grids Individual nodes can be supercomputers, or NOW Individual nodes can be supercomputers, or NOW High availability High availability Accommodate peak usage Accommodate peak usage LAN : Internet :: NOW : Grid LAN : Internet :: NOW : Grid

Mateti, Linux Clusters45 Buildings-Full of Workstations 1. Distributed OS have not taken a foot hold. 2. Powerful personal computers are ubiquitous. 3. Mostly idle: more than 90% of the up-time? Mb/s LANs are common. 5. Windows and Linux are the top two OS in terms of installed base.

Mateti, Linux Clusters46 Networks of Workstations (NOW) Workstation Workstation Network Network Operating System Operating System Cooperation Cooperation Distributed+Parallel Programs Distributed+Parallel Programs

Mateti, Linux Clusters47 What is a Workstation? PC? Mac? Sun …? PC? Mac? Sun …? “Workstation OS” “Workstation OS”

Mateti, Linux Clusters48 “Workstation OS” Authenticated users Authenticated users Protection of resources Protection of resources Multiple processes Multiple processes Preemptive scheduling Preemptive scheduling Virtual Memory Virtual Memory Hierarchical file systems Hierarchical file systems Network centric Network centric

Mateti, Linux Clusters49 Clusters of Workstations Inexpensive alternative to traditional supercomputers Inexpensive alternative to traditional supercomputers High availability High availability Lower down time Lower down time Easier access Easier access Development platform with production runs on traditional supercomputers Development platform with production runs on traditional supercomputers

Mateti, Linux Clusters50 Clusters of Workstations Dedicated Nodes Dedicated Nodes Come-and-Go Nodes Come-and-Go Nodes

Mateti, Linux Clusters51 Clusters with Part Time Nodes Cycle Stealing: Running of jobs on a workstation that don't belong to the owner. Cycle Stealing: Running of jobs on a workstation that don't belong to the owner. Definition of Idleness: E.g., No keyboard and no mouse activity Definition of Idleness: E.g., No keyboard and no mouse activity Tools/Libraries Tools/Libraries Condor Condor PVM PVM MPI MPI

Mateti, Linux Clusters52 Cooperation Workstations are “personal” Workstations are “personal” Others use slows you down Others use slows you down … Willing to share Willing to share Willing to trust Willing to trust

Mateti, Linux Clusters53 Cluster Characteristics Commodity off the shelf hardware Commodity off the shelf hardware Networked Networked Common Home Directories Common Home Directories Open source software and OS Open source software and OS Support message passing programming Support message passing programming Batch scheduling of jobs Batch scheduling of jobs Process migration Process migration

Mateti, Linux Clusters54 Beowulf Cluster Dedicated nodes Dedicated nodes Single System View Single System View Commodity of the shelf hardware Commodity of the shelf hardware Internal high speed network Internal high speed network Open source software and OS Open source software and OS Support parallel programming such as MPI, PVM Support parallel programming such as MPI, PVM Full trust in each other Full trust in each other Login from one node into another without authentication Login from one node into another without authentication Shared file system subtree Shared file system subtree

Mateti, Linux Clusters55 Example Clusters July 1999 July nodes 1000 nodes Used for genetic algorithm research by John Koza, Stanford University Used for genetic algorithm research by John Koza, Stanford University programming.com/ programming.com/ programming.com/ programming.com/

Mateti, Linux Clusters56 Typical Big Beowulf 1000 nodes Beowulf Cluster System 1000 nodes Beowulf Cluster System Used for genetic algorithm research by John Coza, Stanford University Used for genetic algorithm research by John Coza, Stanford University programming.com/ programming.com/

Mateti, Linux Clusters57 Largest Cluster System IBM BlueGene, 2007 IBM BlueGene, 2007 DOE/NNSA/LLNL DOE/NNSA/LLNL Memory: GB Memory: GB OS: CNK/SLES 9 OS: CNK/SLES 9 Interconnect: Proprietary Interconnect: Proprietary PowerPC 440 PowerPC ,496 nodes 106,496 nodes Tera FLOPS on LINPACK Tera FLOPS on LINPACK

Mateti, Linux Clusters World’s Fastest: Roadrunner Operating System: Linux Operating System: Linux Interconnect Infiniband Interconnect InfinibandInfiniband cores: PowerXCell 8i 3200 MHz cores: PowerXCell 8i 3200 MHz 1105 TFlops 1105 TFlops at DOE/NNSA/LANL at DOE/NNSA/LANLDOE/NNSA/LANL

Mateti, Linux Clusters59 Cluster Computers for Rent Transfer executable files, source code or data to your secure personal account on TTI servers (1). Do this securely using winscp for Windows or "secure copy" scp for Linux. Transfer executable files, source code or data to your secure personal account on TTI servers (1). Do this securely using winscp for Windows or "secure copy" scp for Linux. To execute your program, simply submit a job (2) to the scheduler using the "menusub" command or do it manually using "qsub" (we use the popular PBS batch system). There are working examples on how to submit your executable. Your executable is securely placed on one of our in- house clusters for execution (3). To execute your program, simply submit a job (2) to the scheduler using the "menusub" command or do it manually using "qsub" (we use the popular PBS batch system). There are working examples on how to submit your executable. Your executable is securely placed on one of our in- house clusters for execution (3). Your results and data are written to your personal account in real time. Download your results (4). Your results and data are written to your personal account in real time. Download your results (4).

Mateti, Linux Clusters60 Turnkey Cluster Vendors Fully integrated Beowulf clusters with commercially supported Beowulf software systems are available from : Fully integrated Beowulf clusters with commercially supported Beowulf software systems are available from : HP HP IBM IBM Northrop Grumman.com Northrop Grumman.com Accelerated Servers.com Accelerated Servers.com Penguin Computing.com Penguin Computing.com

Mateti, Linux Clusters61 Why are Linux Clusters Good? Low initial implementation cost Low initial implementation cost Inexpensive PCs Inexpensive PCs Standard components and Networks Standard components and Networks Free Software: Linux, GNU, MPI, PVM Free Software: Linux, GNU, MPI, PVM Scalability: can grow and shrink Scalability: can grow and shrink Familiar technology, easy for user to adopt the approach, use and maintain system. Familiar technology, easy for user to adopt the approach, use and maintain system.

Mateti, Linux Clusters OS Share of Top 500 OS Count Share Rmax (GF) Rpeak (GF) Processor Linux % Windows % Unix % BSD % Mixed % MacOS % Totals % Nov

Mateti, Linux Clusters OS Share of Top 500 OS FamilyCountShare %Rmax (GF)Rpeak (GF)Procs Linux % Windows51.00 % Unix % BSD Based10.20 % Mixed % Mac OS10.20 % Totals500100%

Mateti, Linux Clusters64 Many Books on Linux Clusters Search: Search: google.com google.com amazon.com amazon.com Example book: Example book: William Gropp, Ewing Lusk, Thomas Sterling, MIT Press, 2003, ISBN:

Mateti, Linux Clusters65 Why Is Beowulf Good? Low initial implementation cost Low initial implementation cost Inexpensive PCs Inexpensive PCs Standard components and Networks Standard components and Networks Free Software: Linux, GNU, MPI, PVM Free Software: Linux, GNU, MPI, PVM Scalability: can grow and shrink Scalability: can grow and shrink Familiar technology, easy for user to adopt the approach, use and maintain system. Familiar technology, easy for user to adopt the approach, use and maintain system.

Mateti, Linux Clusters66 Single System Image Common filesystem view from any node Common filesystem view from any node Common accounts on all nodes Common accounts on all nodes Single software installation point Single software installation point Easy to install and maintain system Easy to install and maintain system Easy to use for end-users Easy to use for end-users

Mateti, Linux Clusters67 Closed Cluster Configuration compute node compute node compute node compute node High Speed Network Service Network gateway node External Network compute node compute node compute node compute node High Speed Network gateway node External Network File Server node Front-end

Mateti, Linux Clusters68 Open Cluster Configuration compute node compute node compute node compute node compute node compute node compute node compute node External Network File Server node High Speed Network Front-end

Mateti, Linux Clusters69 DIY Interconnection Network Most popular: Fast Ethernet Most popular: Fast Ethernet Network topologies Network topologies Mesh Mesh Torus Torus Switch v. Hub Switch v. Hub

Mateti, Linux Clusters70 Software Components Operating System Operating System Linux, FreeBSD, … Linux, FreeBSD, … “Parallel” Programs “Parallel” Programs PVM, MPI, … PVM, MPI, … Utilities Utilities Open source Open source

Mateti, Linux Clusters71 Cluster Computing Ordinary programs run as-is on clusters is not cluster computing Ordinary programs run as-is on clusters is not cluster computing Cluster computing takes advantage of : Cluster computing takes advantage of : Result parallelism Result parallelism Agenda parallelism Agenda parallelism Reduction operations Reduction operations Process-grain parallelism Process-grain parallelism

Mateti, Linux Clusters72 Google Linux Clusters GFS: The Google File System GFS: The Google File System thousands of terabytes of storage across thousands of disks on over a thousand machines thousands of terabytes of storage across thousands of disks on over a thousand machines 150 million queries per day 150 million queries per day Average response time of 0.25 sec Average response time of 0.25 sec Near-100% uptime Near-100% uptime

Mateti, Linux Clusters73 Cluster Computing Applications Mathematical Mathematical fftw (fast Fourier transform) fftw (fast Fourier transform) pblas (parallel basic linear algebra software) pblas (parallel basic linear algebra software) atlas (a collections of mathematical library) atlas (a collections of mathematical library) sprng (scalable parallel random number generator) sprng (scalable parallel random number generator) sprng MPITB -- MPI toolbox for MATLAB MPITB -- MPI toolbox for MATLAB MPITB Quantum Chemistry software Quantum Chemistry software Gaussian, qchem Gaussian, qchemqchem Molecular Dynamic solver Molecular Dynamic solver NAMD, gromacs, gamess NAMD, gromacs, gamess NAMDgromacsgamess NAMDgromacsgamess Weather modeling Weather modeling MM5 ( MM5 (

Mateti, Linux Clusters74 Development of Cluster Programs New algorithms + code New algorithms + code Old programs re-done: Old programs re-done: Reverse engineer design, and re-code Reverse engineer design, and re-code Use new languages that have distributed and parallel primitives Use new languages that have distributed and parallel primitives With new libraries With new libraries Parallelize legacy code Parallelize legacy code Mechanical conversion by software tools Mechanical conversion by software tools

Mateti, Linux Clusters75 Distributed Programs Spatially distributed programs Spatially distributed programs A part here, a part there, … A part here, a part there, … Parallel Parallel Synergy Synergy Temporally distributed programs Temporally distributed programs Compute half today, half tomorrow Compute half today, half tomorrow Combine the results at the end Combine the results at the end Migratory programs Migratory programs Have computation, will travel Have computation, will travel

Mateti, Linux Clusters76 Technological Bases of Distributed+Parallel Programs Spatially distributed programs Spatially distributed programs Message passing Message passing Temporally distributed programs Temporally distributed programs Shared memory Shared memory Migratory programs Migratory programs Serialization of data and programs Serialization of data and programs

Mateti, Linux Clusters77 Technological Bases for Migratory programs Same CPU architecture Same CPU architecture X86, PowerPC, MIPS, SPARC, …, JVM X86, PowerPC, MIPS, SPARC, …, JVM Same OS + environment Same OS + environment Be able to “checkpoint” Be able to “checkpoint” suspend, and suspend, and then resume computation then resume computation without loss of progress without loss of progress

Mateti, Linux Clusters78 Parallel Programming Languages Shared-memory languages Shared-memory languages Distributed-memory languages Distributed-memory languages Object-oriented languages Object-oriented languages Functional programming languages Functional programming languages Concurrent logic languages Concurrent logic languages Data flow languages Data flow languages

Mateti, Linux Clusters79 Linda: Tuple Spaces, shared mem Atomic Primitives Atomic Primitives In (t) In (t) Read (t) Read (t) Out (t) Out (t) Eval (t) Eval (t) Host language: e.g., C/Linda, JavaSpaces Host language: e.g., C/Linda, JavaSpaces

Mateti, Linux Clusters80 Data Parallel Languages Data is distributed over the processors as a arrays Data is distributed over the processors as a arrays Entire arrays are manipulated: Entire arrays are manipulated: A(1:100) = B(1:100) + C(1:100) A(1:100) = B(1:100) + C(1:100) Compiler generates parallel code Compiler generates parallel code Fortran 90 Fortran 90 High Performance Fortran (HPF) High Performance Fortran (HPF)

Mateti, Linux Clusters81 Parallel Functional Languages Erlang Erlang SISAL SISAL PCN Argonne PCN Argonne Haskell-Eden marburg.de/~eden Haskell-Eden marburg.de/~edenhttp:// marburg.de/~edenhttp:// marburg.de/~eden Objective Caml with BSP Objective Caml with BSP SAC Functional Array Language SAC Functional Array Language

Mateti, Linux Clusters82 Message Passing Libraries Programmer is responsible for initial data distribution, synchronization, and sending and receiving information Programmer is responsible for initial data distribution, synchronization, and sending and receiving information Parallel Virtual Machine (PVM) Parallel Virtual Machine (PVM) Message Passing Interface (MPI) Message Passing Interface (MPI) Bulk Synchronous Parallel model (BSP) Bulk Synchronous Parallel model (BSP)

Mateti, Linux Clusters83 BSP: Bulk Synchronous Parallel model Divides computation into supersteps Divides computation into supersteps In each superstep a processor can work on local data and send messages. In each superstep a processor can work on local data and send messages. At the end of the superstep, a barrier synchronization takes place and all processors receive the messages which were sent in the previous superstep At the end of the superstep, a barrier synchronization takes place and all processors receive the messages which were sent in the previous superstep

Mateti, Linux Clusters84 BSP: Bulk Synchronous Parallel model Book: Rob H. Bisseling, “Parallel Scientific Computation: A Structured Approach using BSP and MPI,” Oxford University Press, 2004, 324 pages, ISBN Book: Rob H. Bisseling, “Parallel Scientific Computation: A Structured Approach using BSP and MPI,” Oxford University Press, 2004, 324 pages, ISBN

Mateti, Linux Clusters85 BSP Library Small number of subroutines to implement Small number of subroutines to implement process creation, process creation, remote data access, and remote data access, and bulk synchronization. bulk synchronization. Linked to C, Fortran, … programs Linked to C, Fortran, … programs

Mateti, Linux Clusters86 Portable Batch System (PBS) Prepare a.cmd file Prepare a.cmd file naming the program and its arguments naming the program and its arguments properties of the job properties of the job the needed resources the needed resources Submit.cmd to the PBS Job Server: qsub command Submit.cmd to the PBS Job Server: qsub command Routing and Scheduling: The Job Server Routing and Scheduling: The Job Server examines.cmd details to route the job to an execution queue. examines.cmd details to route the job to an execution queue. allocates one or more cluster nodes to the job allocates one or more cluster nodes to the job communicates with the Execution Servers (mom's) on the cluster to determine the current state of the nodes. communicates with the Execution Servers (mom's) on the cluster to determine the current state of the nodes. When all of the needed are allocated, passes the.cmd on to the Execution Server on the first node allocated (the "mother superior"). When all of the needed are allocated, passes the.cmd on to the Execution Server on the first node allocated (the "mother superior"). Execution Server Execution Server will login on the first node as the submitting user and run the.cmd file in the user's home directory. will login on the first node as the submitting user and run the.cmd file in the user's home directory. Run an installation defined prologue script. Run an installation defined prologue script. Gathers the job's output to the standard output and standard error Gathers the job's output to the standard output and standard error It will execute installation defined epilogue script. It will execute installation defined epilogue script. Delivers stdout and stdout to the user. Delivers stdout and stdout to the user.

Mateti, Linux Clusters87 TORQUE, an open source PBS Tera-scale Open-source Resource and QUEue manager (TORQUE) enhances OpenPBS Tera-scale Open-source Resource and QUEue manager (TORQUE) enhances OpenPBS Fault Tolerance Fault Tolerance Additional failure conditions checked/handled Additional failure conditions checked/handled Node health check script support Node health check script support Scheduling Interface Scheduling Interface Scalability Scalability Significantly improved server to MOM communication model Significantly improved server to MOM communication model Ability to handle larger clusters (over 15 TF/2,500 processors) Ability to handle larger clusters (over 15 TF/2,500 processors) Ability to handle larger jobs (over 2000 processors) Ability to handle larger jobs (over 2000 processors) Ability to support larger server messages Ability to support larger server messages Logging Logging

Mateti, Linux Clusters88 PVM, and MPI Message passing primitives Message passing primitives Can be embedded in many existing programming languages Can be embedded in many existing programming languages Architecturally portable Architecturally portable Open-sourced implementations Open-sourced implementations

Mateti, Linux Clusters89 Parallel Virtual Machine (PVM) PVM enables a heterogeneous collection of networked computers to be used as a single large parallel computer. PVM enables a heterogeneous collection of networked computers to be used as a single large parallel computer. Older than MPI Older than MPI Large scientific/engineering user community Large scientific/engineering user community

Mateti, Linux Clusters90 Message Passing Interface (MPI) MPI MPI MPICH: by Argonne National Laboratory and Missisippy State University MPICH: by Argonne National Laboratory and Missisippy State Universitywww.mcs.anl.gov/mpi/mpich/ LAM: LAM:

Mateti, Linux Clusters91 OpenMP for shared memory Distributed shared memory API Distributed shared memory API User-gives hints as directives to the compiler User-gives hints as directives to the compiler

Mateti, Linux Clusters92 SPMD Single program, multiple data Single program, multiple data Contrast with SIMD Contrast with SIMD Same program runs on multiple nodes Same program runs on multiple nodes May or may not be lock-step May or may not be lock-step Nodes may be of different speeds Nodes may be of different speeds Barrier synchronization Barrier synchronization

Mateti, Linux Clusters93 Condor Cooperating workstations: come and go. Cooperating workstations: come and go. Migratory programs Migratory programs Checkpointing Checkpointing Remote IO Remote IO Resource matching Resource matching

Mateti, Linux Clusters94 Migration of Jobs Policies Policies Immediate-Eviction Immediate-Eviction Pause-and-Migrate Pause-and-Migrate Technical Issues Technical Issues Check-pointing: Preserving the state of the process so it can be resumed. Check-pointing: Preserving the state of the process so it can be resumed. Migrating from one architecture to another Migrating from one architecture to another

Mateti, Linux Clusters95 Kernels Etc Mods for Clusters Dynamic load balancing Dynamic load balancing Transparent process-migration Transparent process-migration Kernel Mods Kernel Mods CLuster Membership Subsystem ("CLMS") and CLuster Membership Subsystem ("CLMS") and Internode Communication Subsystem Internode Communication Subsystem GlusterFS: Clustered File Storage of peta bytes. GlusterFS: Clustered File Storage of peta bytes. GlusterHPC: High Performance Compute Clusters GlusterHPC: High Performance Compute Clusters Open-source software for volunteer computing and grid computing Open-source software for volunteer computing and grid computing

Mateti, Linux Clusters96 OpenMosix Distro Quantian Linux Quantian Linux Boot from DVD-ROM Boot from DVD-ROM Compressed file system on DVD Compressed file system on DVD Several GB of cluster software Several GB of cluster software Live CD/DVD or Single Floppy Bootables Live CD/DVD or Single Floppy Bootables Can be installed on HDD Can be installed on HDD

Mateti, Linux Clusters97 What is openMOSIX? An open source enhancement to the Linux kernel An open source enhancement to the Linux kernel Cluster with come-and-go nodes Cluster with come-and-go nodes System image model: Virtual machine with lots of memory and CPU System image model: Virtual machine with lots of memory and CPU Granularity: Process Granularity: Process Improves the overall (cluster-wide) performance. Improves the overall (cluster-wide) performance. Multi-user, time-sharing environment for the execution of both sequential and parallel applications Multi-user, time-sharing environment for the execution of both sequential and parallel applications Applications unmodified (no need to link with special library) Applications unmodified (no need to link with special library)

Mateti, Linux Clusters98 What is openMOSIX? Execution environment: Execution environment: farm of diskless x86 based nodes farm of diskless x86 based nodes UP (uniprocessor), or UP (uniprocessor), or SMP (symmetric multi processor) SMP (symmetric multi processor) connected by standard LAN (e.g., Fast Ethernet) connected by standard LAN (e.g., Fast Ethernet) Adaptive resource management to dynamic load characteristics Adaptive resource management to dynamic load characteristics CPU, RAM, I/O, etc. CPU, RAM, I/O, etc. Linear scalability Linear scalability

Mateti, Linux Clusters99 Users’ View of the Cluster Users can start from any node in the cluster, or sysadmin sets-up a few nodes as login nodes Users can start from any node in the cluster, or sysadmin sets-up a few nodes as login nodes Round-robin DNS: “hpc.clusters” with many IPs assigned to same name Round-robin DNS: “hpc.clusters” with many IPs assigned to same name Each process has a Home-Node Each process has a Home-Node Migrated processes always appear to run at the home node, e.g., “ps” show all your processes, even if they run elsewhere Migrated processes always appear to run at the home node, e.g., “ps” show all your processes, even if they run elsewhere

Mateti, Linux Clusters100 MOSIX architecture network transparency network transparency preemptive process migration preemptive process migration dynamic load balancing dynamic load balancing memory sharing memory sharing efficient kernel communication efficient kernel communication probabilistic information dissemination algorithms probabilistic information dissemination algorithms decentralized control and autonomy decentralized control and autonomy

Mateti, Linux Clusters101 A two tier technology Information gathering and dissemination Information gathering and dissemination Support scalable configurations by probabilistic dissemination algorithms Support scalable configurations by probabilistic dissemination algorithms Same overhead for 16 nodes or 2056 nodes Same overhead for 16 nodes or 2056 nodes Pre-emptive process migration that can migrate any process, anywhere, anytime - transparently Pre-emptive process migration that can migrate any process, anywhere, anytime - transparently Supervised by adaptive algorithms that respond to global resource availability Supervised by adaptive algorithms that respond to global resource availability Transparent to applications, no change to user interface Transparent to applications, no change to user interface

Mateti, Linux Clusters102 Tier 1: Information gathering and dissemination In each unit of time (e.g., 1 second) each node gathers information about: In each unit of time (e.g., 1 second) each node gathers information about: CPU(s) speed, load and utilization CPU(s) speed, load and utilization Free memory Free memory Free proc-table/file-table slots Free proc-table/file-table slots Info sent to a randomly selected node Info sent to a randomly selected node Scalable - more nodes better scattering Scalable - more nodes better scattering

Mateti, Linux Clusters103 Tier 2: Process migration Load balancing: reduce variance between pairs of nodes to improve the overall performance Load balancing: reduce variance between pairs of nodes to improve the overall performance Memory ushering: migrate processes from a node that nearly exhausted its free memory, to prevent paging Memory ushering: migrate processes from a node that nearly exhausted its free memory, to prevent paging Parallel File I/O: bring the process to the file-server, direct file I/O from migrated processes Parallel File I/O: bring the process to the file-server, direct file I/O from migrated processes

Mateti, Linux Clusters104 Network transparency The user and applications are provided a virtual machine that looks like a single machine. The user and applications are provided a virtual machine that looks like a single machine. Example: Disk access from diskless nodes on fileserver is completely transparent to programs Example: Disk access from diskless nodes on fileserver is completely transparent to programs

Mateti, Linux Clusters105 Preemptive process migration Any user’s process, trasparently and at any time, can/may migrate to any other node. Any user’s process, trasparently and at any time, can/may migrate to any other node. The migrating process is divided into: The migrating process is divided into: system context (deputy) that may not be migrated from home workstation (UHN); system context (deputy) that may not be migrated from home workstation (UHN); user context (remote) that can be migrated on a diskless node; user context (remote) that can be migrated on a diskless node;

Mateti, Linux Clusters106 Splitting the Linux process System context (environment) - site dependent- “home” confined System context (environment) - site dependent- “home” confined Connected by an exclusive link for both synchronous (system calls) and asynchronous (signals, MOSIX events) Connected by an exclusive link for both synchronous (system calls) and asynchronous (signals, MOSIX events) Process context (code, stack, data) - site independent - may migrate Process context (code, stack, data) - site independent - may migrate Deputy Remote Kernel Userland openMOSIX Link Local master node diskless node

Mateti, Linux Clusters107 Dynamic load balancing Initiates process migrations in order to balance the load of farm Initiates process migrations in order to balance the load of farm responds to variations in the load of the nodes, runtime characteristics of the processes, number of nodes and their speeds responds to variations in the load of the nodes, runtime characteristics of the processes, number of nodes and their speeds makes continuous attempts to reduce the load differences among nodes makes continuous attempts to reduce the load differences among nodes the policy is symmetrical and decentralized the policy is symmetrical and decentralized all of the nodes execute the same algorithm all of the nodes execute the same algorithm the reduction of the load differences is performed indipendently by any pair of nodes the reduction of the load differences is performed indipendently by any pair of nodes

Mateti, Linux Clusters108 Memory sharing places the maximal number of processes in the farm main memory, even if it implies an uneven load distribution among the nodes places the maximal number of processes in the farm main memory, even if it implies an uneven load distribution among the nodes delays as much as possible swapping out of pages delays as much as possible swapping out of pages makes the decision of which process to migrate and where to migrate it is based on the knoweldge of the amount of free memory in other nodes makes the decision of which process to migrate and where to migrate it is based on the knoweldge of the amount of free memory in other nodes

Mateti, Linux Clusters109 Efficient kernel communication Reduces overhead of the internal kernel communications (e.g. between the process and its home site, when it is executing in a remote site) Reduces overhead of the internal kernel communications (e.g. between the process and its home site, when it is executing in a remote site) Fast and reliable protocol with low startup latency and high throughput Fast and reliable protocol with low startup latency and high throughput

Mateti, Linux Clusters110 Probabilistic information dissemination algorithms Each node has sufficient knowledge about available resources in other nodes, without polling Each node has sufficient knowledge about available resources in other nodes, without polling measure the amount of available resources on each node measure the amount of available resources on each node receive resources indices that each node sends at regular intervals to a randomly chosen subset of nodes receive resources indices that each node sends at regular intervals to a randomly chosen subset of nodes the use of randomly chosen subset of nodes facilitates dynamic configuration and overcomes node failures the use of randomly chosen subset of nodes facilitates dynamic configuration and overcomes node failures

Mateti, Linux Clusters111 Decentralized control and autonomy Each node makes its own control decisions independently. Each node makes its own control decisions independently. No master-slave relationships No master-slave relationships Each node is capable of operating as an independent system Each node is capable of operating as an independent system Nodes may join or leave the farm with minimal disruption Nodes may join or leave the farm with minimal disruption

Mateti, Linux Clusters112 File System Access MOSIX is particularly efficient for distributing and executing CPU-bound processes MOSIX is particularly efficient for distributing and executing CPU-bound processes However, the processes are inefficient with significant file operations However, the processes are inefficient with significant file operations I/O accesses through the home node incur high overhead I/O accesses through the home node incur high overhead “Direct FSA” is for better handling of I/O: “Direct FSA” is for better handling of I/O: Reduce the overhead of executing I/O oriented system-calls of a migrated process Reduce the overhead of executing I/O oriented system-calls of a migrated process a migrated process performs I/O operations locally, in the current node, not via the home node a migrated process performs I/O operations locally, in the current node, not via the home node processes migrate more freely processes migrate more freely

Mateti, Linux Clusters113 DFSA Requirements DFSA can work with any file system that satisfies some properties. DFSA can work with any file system that satisfies some properties. Unique mount point: The FS are identically mounted on all. Unique mount point: The FS are identically mounted on all. File consistency: when an operation is completed in one node, any subsequent operation on any other node will see the results of that operation File consistency: when an operation is completed in one node, any subsequent operation on any other node will see the results of that operation Required because an openMOSIX process may perform consecutive syscalls from different nodes Required because an openMOSIX process may perform consecutive syscalls from different nodes Time-stamp consistency: if file A is modified after B, A must have a timestamp > B's timestamp Time-stamp consistency: if file A is modified after B, A must have a timestamp > B's timestamp

Mateti, Linux Clusters114 DFSA Conforming FS Global File System (GFS) Global File System (GFS) openMOSIX File System (MFS) openMOSIX File System (MFS) Lustre global file system Lustre global file system General Parallel File System (GPFS) General Parallel File System (GPFS) Parallel Virtual File System (PVFS) Parallel Virtual File System (PVFS) Available operations: all common file- system and I/O system-calls Available operations: all common file- system and I/O system-calls

Mateti, Linux Clusters115 Global File System (GFS) Provides local caching and cache consistency over the cluster using a unique locking mechanism Provides local caching and cache consistency over the cluster using a unique locking mechanism Provides direct access from any node to any storage entity Provides direct access from any node to any storage entity GFS + process migration combine the advantages of load-balancing with direct disk access from any node - for parallel file operations GFS + process migration combine the advantages of load-balancing with direct disk access from any node - for parallel file operations Non-GNU License (SPL) Non-GNU License (SPL)

Mateti, Linux Clusters116 The MOSIX File System (MFS) Provides a unified view of all files and all mounted FSs on all the nodes of a MOSIX cluster as if they were within a single file system. Provides a unified view of all files and all mounted FSs on all the nodes of a MOSIX cluster as if they were within a single file system. Makes all directories and regular files throughout an openMOSIX cluster available from all the nodes Makes all directories and regular files throughout an openMOSIX cluster available from all the nodes Provides cache consistency Provides cache consistency Allows parallel file access by proper distribution of files (a process migrates to the node with the needed files) Allows parallel file access by proper distribution of files (a process migrates to the node with the needed files)

Mateti, Linux Clusters117 MFS Namespace / etcusrvarbin mfs / etcusrvarbinmfs

Mateti, Linux Clusters118 Lustre: A scalable File System Scalable data serving through parallel data striping Scalable data serving through parallel data striping Scalable meta data Scalable meta data Separation of file meta data and storage allocation meta data to further increase scalability Separation of file meta data and storage allocation meta data to further increase scalability Object technology - allowing stackable, value- add functionality Object technology - allowing stackable, value- add functionality Distributed operation Distributed operation

Mateti, Linux Clusters119 Parallel Virtual File System (PVFS) User-controlled striping of files across nodes User-controlled striping of files across nodes Commodity network and storage hardware Commodity network and storage hardware MPI-IO support through ROMIO MPI-IO support through ROMIO Traditional Linux file system access through the pvfs-kernel package Traditional Linux file system access through the pvfs-kernel package The native PVFS library interface The native PVFS library interface

Mateti, Linux Clusters120 General Parallel File Sys (GPFS) gpfs.html gpfs.html gpfs.html gpfs.html “GPFS for Linux provides world class performance, scalability, and availability for file systems. It offers compliance to most UNIX file standards for end user applications and administrative extensions for ongoing management and tuning. It scales with the size of the Linux cluster and provides NFS Export capabilities outside the cluster.” “GPFS for Linux provides world class performance, scalability, and availability for file systems. It offers compliance to most UNIX file standards for end user applications and administrative extensions for ongoing management and tuning. It scales with the size of the Linux cluster and provides NFS Export capabilities outside the cluster.”

Mateti, Linux Clusters121 Mosix Ancillary Tools Kernel debugger Kernel debugger Kernel profiler Kernel profiler Parallel make (all exec() become mexec()) Parallel make (all exec() become mexec()) openMosix pvm openMosix pvm openMosix mm5 openMosix mm5 openMosix HMMER openMosix HMMER openMosix Mathematica openMosix Mathematica

Mateti, Linux Clusters122 Cluster Administration LTSP ( LTSP ( ClumpOs ( ClumpOs ( Mps Mps Mtop Mtop Mosctl Mosctl

Mateti, Linux Clusters123 Mosix commands & files setpe – starts and stops Mosix on the current node setpe – starts and stops Mosix on the current node tune – calibrates the node speed parameters tune – calibrates the node speed parameters mtune – calibrates the node MFS parameters mtune – calibrates the node MFS parameters migrate – forces a process to migrate migrate – forces a process to migrate mosctl – comprehensive Mosix administration tool mosctl – comprehensive Mosix administration tool mosrun, nomig, runhome, runon, cpujob, iojob, nodecay, fastdecay, slowdecay – various way to start a program in a specific way mosrun, nomig, runhome, runon, cpujob, iojob, nodecay, fastdecay, slowdecay – various way to start a program in a specific way mon & mosixview – CLI and graphic interface to monitor the cluster status mon & mosixview – CLI and graphic interface to monitor the cluster status /etc/mosix.map – contains the IP numbers of the cluster nodes /etc/mosix.map – contains the IP numbers of the cluster nodes /etc/mosgates – contains the number of gateway nodes present in the cluster /etc/mosgates – contains the number of gateway nodes present in the cluster /etc/overheads – contains the output of the ‘tune’ command to be loaded at startup /etc/overheads – contains the output of the ‘tune’ command to be loaded at startup /etc/mfscosts – contains the output of the ‘mtune’ command to be loaded at startup /etc/mfscosts – contains the output of the ‘mtune’ command to be loaded at startup /proc/mosix/admin/* - various files, sometimes binary, to check and control Mosix /proc/mosix/admin/* - various files, sometimes binary, to check and control Mosix

Mateti, Linux Clusters124 Monitoring Cluster monitor - ‘mosmon’(or ‘qtop’) Cluster monitor - ‘mosmon’(or ‘qtop’) Displays load, speed, utilization and memory information across the cluster. Displays load, speed, utilization and memory information across the cluster. Uses the /proc/hpc/info interface for the retrieving information Uses the /proc/hpc/info interface for the retrieving information Applet/CGI based monitoring tools - display cluster properties Applet/CGI based monitoring tools - display cluster properties Access via the Internet Access via the Internet Multiple resources Multiple resources openMosixview with X GUI openMosixview with X GUI

Mateti, Linux Clusters125 openMosixview by Mathias Rechemburg by Mathias Rechemburg

Mateti, Linux Clusters126 Qlusters OS Based in part on openMosix technology Based in part on openMosix technology Migrating sockets Migrating sockets Network RAM already implemented Network RAM already implemented Cluster Installer, Configurator, Monitor, Queue Manager, Launcher, Scheduler Cluster Installer, Configurator, Monitor, Queue Manager, Launcher, Scheduler Partnership with IBM, Compaq, Red Hat and Intel Partnership with IBM, Compaq, Red Hat and Intel

Mateti, Linux Clusters127 QlusterOS Monitor

Mateti, Linux Clusters128 More Information on Clusters IEEE Task Force on Cluster Computing. (now Technical Committee on Scalable Computing TCSC). IEEE Task Force on Cluster Computing. (now Technical Committee on Scalable Computing TCSC). lcic.org/ “a central repository of links and information regarding Linux clustering, in all its forms.” lcic.org/ “a central repository of links and information regarding Linux clustering, in all its forms.” lcic.org/ resources for of clusters built on commodity hardware deploying Linux OS and open source software. resources for of clusters built on commodity hardware deploying Linux OS and open source software. linuxclusters.com/ “Authoritative resource for information on Linux Compute Clusters and Linux High Availability Clusters.” linuxclusters.com/ “Authoritative resource for information on Linux Compute Clusters and Linux High Availability Clusters.” linuxclusters.com/ “To provide education and advanced technical training for the deployment and use of Linux-based computing clusters to the high-performance computing community worldwide.” “To provide education and advanced technical training for the deployment and use of Linux-based computing clusters to the high-performance computing community worldwide.”

Mateti, Linux Clusters129 Code-Granularity Code Item Large grain (task level) Program Medium grain (control level) Function (thread) Fine grain (data level) Loop (Compiler) Very fine grain (multiple issue) With hardware Task i-l Task i Task i+1 func1 ( ) {.... } func1 ( ) {.... } func2 ( ) {.... } func2 ( ) {.... } func3 ( ) {.... } func3 ( ) {.... } a ( 0 ) =.. b ( 0 ) =.. a ( 0 ) =.. b ( 0 ) =.. a ( 1 )=.. b ( 1 )=.. a ( 1 )=.. b ( 1 )=.. a ( 2 )=.. b ( 2 )=.. a ( 2 )=.. b ( 2 )= x x Load PVM/MPI Threads Compilers CPU Levels of Parallelism