Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cluster Computing with Linux Prabhaker Mateti Wright State University.

Similar presentations


Presentation on theme: "Cluster Computing with Linux Prabhaker Mateti Wright State University."— Presentation transcript:

1 Cluster Computing with Linux Prabhaker Mateti Wright State University

2 Mateti, Linux Clusters2 Abstract Cluster computing distributes the computational load to collections of similar machines. This talk describes what cluster computing is, the typical Linux packages used, and examples of large clusters in use today. This talk also reviews cluster computing modifications of the Linux kernel. Cluster computing distributes the computational load to collections of similar machines. This talk describes what cluster computing is, the typical Linux packages used, and examples of large clusters in use today. This talk also reviews cluster computing modifications of the Linux kernel.

3 Mateti, Linux Clusters3 What Kind of Computing, did you say? Sequential Sequential Concurrent Concurrent Parallel Parallel Distributed Distributed Networked Networked Migratory Migratory Cluster Cluster Grid Grid Pervasive Pervasive Quantum Quantum Optical Optical Molecular Molecular

4 Fundamentals Overview

5 Mateti, Linux Clusters5 Fundamentals Overview Granularity of Parallelism Granularity of Parallelism Synchronization Synchronization Message Passing Message Passing Shared Memory Shared Memory

6 Mateti, Linux Clusters6 Granularity of Parallelism Fine-Grained Parallelism Fine-Grained Parallelism Medium-Grained Parallelism Medium-Grained Parallelism Coarse-Grained Parallelism Coarse-Grained Parallelism NOWs (Networks of Workstations) NOWs (Networks of Workstations)

7 Mateti, Linux Clusters7 Fine-Grained Machines Tens of thousands of Processor Elements Tens of thousands of Processor Elements Processor Elements Processor Elements Slow (bit serial) Slow (bit serial) Small Fast Private RAM Small Fast Private RAM Shared Memory Shared Memory Interconnection Networks Interconnection Networks Message Passing Message Passing Single Instruction Multiple Data (SIMD) Single Instruction Multiple Data (SIMD)

8 Mateti, Linux Clusters8 Medium-Grained Machines Typical Configurations Typical Configurations Thousands of processors Thousands of processors Processors have power between coarse- and fine-grained Processors have power between coarse- and fine-grained Either shared or distributed memory Either shared or distributed memory Traditionally: Research Machines Traditionally: Research Machines Single Code Multiple Data (SCMD) Single Code Multiple Data (SCMD)

9 Mateti, Linux Clusters9 Coarse-Grained Machines Typical Configurations Typical Configurations Hundreds/Thousands of Processors Hundreds/Thousands of Processors Processors Processors Powerful (fast CPUs) Powerful (fast CPUs) Large (cache, vectors, multiple fast buses) Large (cache, vectors, multiple fast buses) Memory: Shared or Distributed-Shared Memory: Shared or Distributed-Shared Multiple Instruction Multiple Data (MIMD) Multiple Instruction Multiple Data (MIMD)

10 Mateti, Linux Clusters10 Networks of Workstations Exploit inexpensive Workstations/PCs Exploit inexpensive Workstations/PCs Commodity network Commodity network The NOW becomes a “distributed memory multiprocessor” The NOW becomes a “distributed memory multiprocessor” Workstations send+receive messages Workstations send+receive messages C and Fortran programs with PVM, MPI, etc. libraries C and Fortran programs with PVM, MPI, etc. libraries Programs developed on NOWs are portable to supercomputers for production runs Programs developed on NOWs are portable to supercomputers for production runs

11 Mateti, Linux Clusters11 Definition of “Parallel” S1 begins at time b1, ends at e1 S1 begins at time b1, ends at e1 S2 begins at time b2, ends at e2 S2 begins at time b2, ends at e2 S1 || S2 S1 || S2 Begins at min(b1, b2) Begins at min(b1, b2) Ends at max(e1, e2) Ends at max(e1, e2) Commutative (Equiv to S2 || S1) Commutative (Equiv to S2 || S1)

12 Mateti, Linux Clusters12 Data Dependency x := a + b; y := c + d; x := a + b; y := c + d; x := a + b || y := c + d; x := a + b || y := c + d; y := c + d; x := a + b; y := c + d; x := a + b; X depends on a and b, y depends on c and d X depends on a and b, y depends on c and d Assumed a, b, c, d were independent Assumed a, b, c, d were independent

13 Mateti, Linux Clusters13 Types of Parallelism Result: Data structure can be split into parts of same structure. Result: Data structure can be split into parts of same structure. Specialist: Each node specializes. Pipelines. Specialist: Each node specializes. Pipelines. Agenda: Have list of things to do. Each node can generalize. Agenda: Have list of things to do. Each node can generalize.

14 Mateti, Linux Clusters14 Result Parallelism Also called Also called Embarrassingly Parallel Embarrassingly Parallel Perfect Parallel Perfect Parallel Computations that can be subdivided into sets of independent tasks that require little or no communication Computations that can be subdivided into sets of independent tasks that require little or no communication Monte Carlo simulations Monte Carlo simulations F(x, y, z) F(x, y, z)

15 Mateti, Linux Clusters15 Specialist Parallelism Different operations performed simultaneously on different processors Different operations performed simultaneously on different processors E.g., Simulating a chemical plant; one processor simulates the preprocessing of chemicals, one simulates reactions in first batch, another simulates refining the products, etc. E.g., Simulating a chemical plant; one processor simulates the preprocessing of chemicals, one simulates reactions in first batch, another simulates refining the products, etc.

16 Mateti, Linux Clusters16 Agenda Parallelism: MW Model Manager Manager Initiates computation Initiates computation Tracks progress Tracks progress Handles worker’s requests Handles worker’s requests Interfaces with user Interfaces with user Workers Workers Spawned and terminated by manager Spawned and terminated by manager Make requests to manager Make requests to manager Send results to manager Send results to manager

17 Mateti, Linux Clusters17 Embarrassingly Parallel Result Parallelism is obvious Result Parallelism is obvious Ex1: Compute the square root of each of the million numbers given. Ex1: Compute the square root of each of the million numbers given. Ex2: Search for a given set of words among a billion web pages. Ex2: Search for a given set of words among a billion web pages.

18 Mateti, Linux Clusters18 Reduction Combine several sub-results into one Combine several sub-results into one Reduce r1 r2 … rn with op Reduce r1 r2 … rn with op Becomes r1 op r2 op … op rn Becomes r1 op r2 op … op rn Hadoop is based on this idea Hadoop is based on this idea

19 Mateti, Linux Clusters19 Shared Memory Process A writes to a memory location Process A writes to a memory location Process B reads from that memory location Process B reads from that memory location Synchronization is crucial Synchronization is crucial Excellent speed Excellent speed Semantics … ? Semantics … ?

20 Mateti, Linux Clusters20 Shared Memory Needs hardware support: Needs hardware support: multi-ported memory multi-ported memory Atomic operations: Atomic operations: Test-and-Set Test-and-Set Semaphores Semaphores

21 Mateti, Linux Clusters21 Shared Memory Semantics: Assumptions Global time is available. Discrete increments. Global time is available. Discrete increments. Shared variable, s = vi at ti, i=0,… Shared variable, s = vi at ti, i=0,… Process A: s := v1 at time t1 Process A: s := v1 at time t1 Assume no other assignment occurred after t1. Assume no other assignment occurred after t1. Process B reads s at time t and gets value v. Process B reads s at time t and gets value v.

22 Mateti, Linux Clusters22 Shared Memory: Semantics Value of Shared Variable Value of Shared Variable v = v1, if t > t1 v = v1, if t > t1 v = v0, if t < t1 v = v0, if t < t1 v = ??, if t = t1 v = ??, if t = t1 t = t1 +- discrete quantum t = t1 +- discrete quantum Next Update of Shared Variable Next Update of Shared Variable Occurs at t2 Occurs at t2 t2 = t1 + ? t2 = t1 + ?

23 Mateti, Linux Clusters23 Distributed Shared Memory “Simultaneous” read/write access by spatially distributed processors “Simultaneous” read/write access by spatially distributed processors Abstraction layer of an implementation built from message passing primitives Abstraction layer of an implementation built from message passing primitives Semantics not so clean Semantics not so clean

24 Mateti, Linux Clusters24 Semaphores Semaphore s; Semaphore s; V(s) ::=  s := s + 1  V(s) ::=  s := s + 1  P(s) ::=  when s > 0 do s := s – 1  P(s) ::=  when s > 0 do s := s – 1  Deeply studied theory.

25 Mateti, Linux Clusters25 Condition Variables Condition C; Condition C; C.wait() C.wait() C.signal() C.signal()

26 Mateti, Linux Clusters26 Distributed Shared Memory A common address space that all the computers in the cluster share. A common address space that all the computers in the cluster share. Difficult to describe semantics. Difficult to describe semantics.

27 Mateti, Linux Clusters27 Distributed Shared Memory: Issues Distributed Distributed Spatially Spatially LAN LAN WAN WAN No global time available No global time available

28 Mateti, Linux Clusters28 Distributed Computing No shared memory No shared memory Communication among processes Communication among processes Send a message Send a message Receive a message Receive a message Asynchronous Asynchronous Synchronous Synchronous Synergy among processes Synergy among processes

29 Mateti, Linux Clusters29 Messages Messages are sequences of bytes moving between processes Messages are sequences of bytes moving between processes The sender and receiver must agree on the type structure of values in the message The sender and receiver must agree on the type structure of values in the message “Marshalling”: data layout so that there is no ambiguity such as “four chars” v. “one integer”. “Marshalling”: data layout so that there is no ambiguity such as “four chars” v. “one integer”.

30 Mateti, Linux Clusters30 Message Passing Process A sends a data buffer as a message to process B. Process A sends a data buffer as a message to process B. Process B waits for a message from A, and when it arrives copies it into its own local memory. Process B waits for a message from A, and when it arrives copies it into its own local memory. No memory shared between A and B. No memory shared between A and B.

31 Mateti, Linux Clusters31 Message Passing Obviously, Obviously, Messages cannot be received before they are sent. Messages cannot be received before they are sent. A receiver waits until there is a message. A receiver waits until there is a message. Asynchronous Asynchronous Sender never blocks, even if infinitely many messages are waiting to be received Sender never blocks, even if infinitely many messages are waiting to be received Semi-asynchronous is a practical version of above with large but finite amount of buffering Semi-asynchronous is a practical version of above with large but finite amount of buffering

32 Mateti, Linux Clusters32 Message Passing: Point to Point Q: send(m, P) Q: send(m, P) Send message M to process P Send message M to process P P: recv(x, Q) P: recv(x, Q) Receive message from process Q, and place it in variable x Receive message from process Q, and place it in variable x The message data The message data Type of x must match that of m Type of x must match that of m As if x := m As if x := m

33 Mateti, Linux Clusters33 Broadcast One sender Q, multiple receivers P One sender Q, multiple receivers P Not all receivers may receive at the same time Not all receivers may receive at the same time Q: broadcast (m) Q: broadcast (m) Send message M to processes Send message M to processes P: recv(x, Q) P: recv(x, Q) Receive message from process Q, and place it in variable x Receive message from process Q, and place it in variable x

34 Mateti, Linux Clusters34 Synchronous Message Passing Sender blocks until receiver is ready to receive. Sender blocks until receiver is ready to receive. Cannot send messages to self. Cannot send messages to self. No buffering. No buffering.

35 Mateti, Linux Clusters35 Asynchronous Message Passing Sender never blocks. Sender never blocks. Receiver receives when ready. Receiver receives when ready. Can send messages to self. Can send messages to self. Infinite buffering. Infinite buffering.

36 Mateti, Linux Clusters36 Message Passing Speed not so good Speed not so good Sender copies message into system buffers. Sender copies message into system buffers. Message travels the network. Message travels the network. Receiver copies message from system buffers into local memory. Receiver copies message from system buffers into local memory. Special virtual memory techniques help. Special virtual memory techniques help. Programming Quality Programming Quality less error-prone cf. shared memory less error-prone cf. shared memory

37 Mateti, Linux Clusters37 Computer Architectures

38 Mateti, Linux Clusters38 Architectures of Top 500 Sys

39 Mateti, Linux Clusters39 Architectures of Top 500 Sys

40 Mateti, Linux Clusters40

41 Mateti, Linux Clusters41 “Parallel” Computers Traditional supercomputers Traditional supercomputers SIMD, MIMD, pipelines SIMD, MIMD, pipelines Tightly coupled shared memory Tightly coupled shared memory Bus level connections Bus level connections Expensive to buy and to maintain Expensive to buy and to maintain Cooperating networks of computers Cooperating networks of computers

42 Mateti, Linux Clusters42 Traditional Supercomputers Very high starting cost Very high starting cost Expensive hardware Expensive hardware Expensive software Expensive software High maintenance High maintenance Expensive to upgrade Expensive to upgrade

43 Mateti, Linux Clusters43 Computational Grids “Grids are persistent environments that enable software applications to integrate instruments, displays, computational and information resources that are managed by diverse organizations in widespread locations.” “Grids are persistent environments that enable software applications to integrate instruments, displays, computational and information resources that are managed by diverse organizations in widespread locations.”

44 Mateti, Linux Clusters44 Computational Grids Individual nodes can be supercomputers, or NOW Individual nodes can be supercomputers, or NOW High availability High availability Accommodate peak usage Accommodate peak usage LAN : Internet :: NOW : Grid LAN : Internet :: NOW : Grid

45 Mateti, Linux Clusters45 Buildings-Full of Workstations 1. Distributed OS have not taken a foot hold. 2. Powerful personal computers are ubiquitous. 3. Mostly idle: more than 90% of the up-time? 4. 100 Mb/s LANs are common. 5. Windows and Linux are the top two OS in terms of installed base.

46 Mateti, Linux Clusters46 Networks of Workstations (NOW) Workstation Workstation Network Network Operating System Operating System Cooperation Cooperation Distributed+Parallel Programs Distributed+Parallel Programs

47 Mateti, Linux Clusters47 What is a Workstation? PC? Mac? Sun …? PC? Mac? Sun …? “Workstation OS” “Workstation OS”

48 Mateti, Linux Clusters48 “Workstation OS” Authenticated users Authenticated users Protection of resources Protection of resources Multiple processes Multiple processes Preemptive scheduling Preemptive scheduling Virtual Memory Virtual Memory Hierarchical file systems Hierarchical file systems Network centric Network centric

49 Mateti, Linux Clusters49 Clusters of Workstations Inexpensive alternative to traditional supercomputers Inexpensive alternative to traditional supercomputers High availability High availability Lower down time Lower down time Easier access Easier access Development platform with production runs on traditional supercomputers Development platform with production runs on traditional supercomputers

50 Mateti, Linux Clusters50 Clusters of Workstations Dedicated Nodes Dedicated Nodes Come-and-Go Nodes Come-and-Go Nodes

51 Mateti, Linux Clusters51 Clusters with Part Time Nodes Cycle Stealing: Running of jobs on a workstation that don't belong to the owner. Cycle Stealing: Running of jobs on a workstation that don't belong to the owner. Definition of Idleness: E.g., No keyboard and no mouse activity Definition of Idleness: E.g., No keyboard and no mouse activity Tools/Libraries Tools/Libraries Condor Condor PVM PVM MPI MPI

52 Mateti, Linux Clusters52 Cooperation Workstations are “personal” Workstations are “personal” Others use slows you down Others use slows you down … Willing to share Willing to share Willing to trust Willing to trust

53 Mateti, Linux Clusters53 Cluster Characteristics Commodity off the shelf hardware Commodity off the shelf hardware Networked Networked Common Home Directories Common Home Directories Open source software and OS Open source software and OS Support message passing programming Support message passing programming Batch scheduling of jobs Batch scheduling of jobs Process migration Process migration

54 Mateti, Linux Clusters54 Beowulf Cluster Dedicated nodes Dedicated nodes Single System View Single System View Commodity of the shelf hardware Commodity of the shelf hardware Internal high speed network Internal high speed network Open source software and OS Open source software and OS Support parallel programming such as MPI, PVM Support parallel programming such as MPI, PVM Full trust in each other Full trust in each other Login from one node into another without authentication Login from one node into another without authentication Shared file system subtree Shared file system subtree

55 Mateti, Linux Clusters55 Example Clusters July 1999 July 1999 1000 nodes 1000 nodes Used for genetic algorithm research by John Koza, Stanford University Used for genetic algorithm research by John Koza, Stanford University www.genetic- programming.com/ www.genetic- programming.com/ www.genetic- programming.com/ www.genetic- programming.com/

56 Mateti, Linux Clusters56 Typical Big Beowulf 1000 nodes Beowulf Cluster System 1000 nodes Beowulf Cluster System Used for genetic algorithm research by John Coza, Stanford University Used for genetic algorithm research by John Coza, Stanford University http://www.genetic- programming.com/ http://www.genetic- programming.com/

57 Mateti, Linux Clusters57 Largest Cluster System IBM BlueGene, 2007 IBM BlueGene, 2007 DOE/NNSA/LLNL DOE/NNSA/LLNL Memory: 73728 GB Memory: 73728 GB OS: CNK/SLES 9 OS: CNK/SLES 9 Interconnect: Proprietary Interconnect: Proprietary PowerPC 440 PowerPC 440 106,496 nodes 106,496 nodes 478.2 Tera FLOPS on LINPACK 478.2 Tera FLOPS on LINPACK

58 Mateti, Linux Clusters58 2008 World’s Fastest: Roadrunner Operating System: Linux Operating System: Linux Interconnect Infiniband Interconnect InfinibandInfiniband 129600 cores: PowerXCell 8i 3200 MHz 129600 cores: PowerXCell 8i 3200 MHz 1105 TFlops 1105 TFlops at DOE/NNSA/LANL at DOE/NNSA/LANLDOE/NNSA/LANL

59 Mateti, Linux Clusters59 Cluster Computers for Rent Transfer executable files, source code or data to your secure personal account on TTI servers (1). Do this securely using winscp for Windows or "secure copy" scp for Linux. Transfer executable files, source code or data to your secure personal account on TTI servers (1). Do this securely using winscp for Windows or "secure copy" scp for Linux. To execute your program, simply submit a job (2) to the scheduler using the "menusub" command or do it manually using "qsub" (we use the popular PBS batch system). There are working examples on how to submit your executable. Your executable is securely placed on one of our in- house clusters for execution (3). To execute your program, simply submit a job (2) to the scheduler using the "menusub" command or do it manually using "qsub" (we use the popular PBS batch system). There are working examples on how to submit your executable. Your executable is securely placed on one of our in- house clusters for execution (3). Your results and data are written to your personal account in real time. Download your results (4). Your results and data are written to your personal account in real time. Download your results (4).

60 Mateti, Linux Clusters60 Turnkey Cluster Vendors Fully integrated Beowulf clusters with commercially supported Beowulf software systems are available from : Fully integrated Beowulf clusters with commercially supported Beowulf software systems are available from : HP www.hp.com/solutions/enterprise/highavailability/ HP www.hp.com/solutions/enterprise/highavailability/ IBM www.ibm.com/servers/eserver/clusters/ IBM www.ibm.com/servers/eserver/clusters/www.ibm.com/servers/eserver/clusters/ Northrop Grumman.com Northrop Grumman.com Accelerated Servers.com Accelerated Servers.com Penguin Computing.com Penguin Computing.com www.aspsys.com/clusters www.aspsys.com/clusters www.pssclabs.com www.pssclabs.com

61 Mateti, Linux Clusters61 Why are Linux Clusters Good? Low initial implementation cost Low initial implementation cost Inexpensive PCs Inexpensive PCs Standard components and Networks Standard components and Networks Free Software: Linux, GNU, MPI, PVM Free Software: Linux, GNU, MPI, PVM Scalability: can grow and shrink Scalability: can grow and shrink Familiar technology, easy for user to adopt the approach, use and maintain system. Familiar technology, easy for user to adopt the approach, use and maintain system.

62 Mateti, Linux Clusters62 2007 OS Share of Top 500 OS Count Share Rmax (GF) Rpeak (GF) Processor Linux 426 85.20% 4897046 7956758 970790 Windows 6 1.20% 47495 86797 12112 Unix 30 6.00% 408378 519178 73532 BSD 2 0.40% 44783 50176 5696 Mixed 34 6.80% 1540037 1900361 580693 MacOS 2 0.40% 28430 44816 5272 Totals 500 100% 6966169 10558086 1648095 http://www.top500.org/stats/list/30/osfamhttp://www.top500.org/stats/list/30/osfam Nov 2007 http://www.top500.org/stats/list/30/osfam

63 Mateti, Linux Clusters63 2007 OS Share of Top 500 OS FamilyCountShare %Rmax (GF)Rpeak (GF)Procs Linux43987.80 %13309834207751712099535 Windows51.00 %32811442955554144 Unix234.60 %881289119801285376 BSD Based10.20 %35860409605120 Mixed316.20 %23560482933610869676 Mac OS10.20 %16180245763072 Totals500100%16927325254018833116923

64 Mateti, Linux Clusters64 Many Books on Linux Clusters Search: Search: google.com google.com amazon.com amazon.com Example book: Example book: William Gropp, Ewing Lusk, Thomas Sterling, MIT Press, 2003, ISBN: 0-262-69292-9

65 Mateti, Linux Clusters65 Why Is Beowulf Good? Low initial implementation cost Low initial implementation cost Inexpensive PCs Inexpensive PCs Standard components and Networks Standard components and Networks Free Software: Linux, GNU, MPI, PVM Free Software: Linux, GNU, MPI, PVM Scalability: can grow and shrink Scalability: can grow and shrink Familiar technology, easy for user to adopt the approach, use and maintain system. Familiar technology, easy for user to adopt the approach, use and maintain system.

66 Mateti, Linux Clusters66 Single System Image Common filesystem view from any node Common filesystem view from any node Common accounts on all nodes Common accounts on all nodes Single software installation point Single software installation point Easy to install and maintain system Easy to install and maintain system Easy to use for end-users Easy to use for end-users

67 Mateti, Linux Clusters67 Closed Cluster Configuration compute node compute node compute node compute node High Speed Network Service Network gateway node External Network compute node compute node compute node compute node High Speed Network gateway node External Network File Server node Front-end

68 Mateti, Linux Clusters68 Open Cluster Configuration compute node compute node compute node compute node compute node compute node compute node compute node External Network File Server node High Speed Network Front-end

69 Mateti, Linux Clusters69 DIY Interconnection Network Most popular: Fast Ethernet Most popular: Fast Ethernet Network topologies Network topologies Mesh Mesh Torus Torus Switch v. Hub Switch v. Hub

70 Mateti, Linux Clusters70 Software Components Operating System Operating System Linux, FreeBSD, … Linux, FreeBSD, … “Parallel” Programs “Parallel” Programs PVM, MPI, … PVM, MPI, … Utilities Utilities Open source Open source

71 Mateti, Linux Clusters71 Cluster Computing Ordinary programs run as-is on clusters is not cluster computing Ordinary programs run as-is on clusters is not cluster computing Cluster computing takes advantage of : Cluster computing takes advantage of : Result parallelism Result parallelism Agenda parallelism Agenda parallelism Reduction operations Reduction operations Process-grain parallelism Process-grain parallelism

72 Mateti, Linux Clusters72 Google Linux Clusters GFS: The Google File System GFS: The Google File System thousands of terabytes of storage across thousands of disks on over a thousand machines thousands of terabytes of storage across thousands of disks on over a thousand machines 150 million queries per day 150 million queries per day Average response time of 0.25 sec Average response time of 0.25 sec Near-100% uptime Near-100% uptime

73 Mateti, Linux Clusters73 Cluster Computing Applications Mathematical Mathematical fftw (fast Fourier transform) fftw (fast Fourier transform) pblas (parallel basic linear algebra software) pblas (parallel basic linear algebra software) atlas (a collections of mathematical library) atlas (a collections of mathematical library) sprng (scalable parallel random number generator) sprng (scalable parallel random number generator) sprng MPITB -- MPI toolbox for MATLAB MPITB -- MPI toolbox for MATLAB MPITB Quantum Chemistry software Quantum Chemistry software Gaussian, qchem Gaussian, qchemqchem Molecular Dynamic solver Molecular Dynamic solver NAMD, gromacs, gamess NAMD, gromacs, gamess NAMDgromacsgamess NAMDgromacsgamess Weather modeling Weather modeling MM5 (http://www.mmm.ucar.edu/mm5/mm5-home.html) MM5 (http://www.mmm.ucar.edu/mm5/mm5-home.html)http://www.mmm.ucar.edu/mm5/mm5-home.html

74 Mateti, Linux Clusters74 Development of Cluster Programs New algorithms + code New algorithms + code Old programs re-done: Old programs re-done: Reverse engineer design, and re-code Reverse engineer design, and re-code Use new languages that have distributed and parallel primitives Use new languages that have distributed and parallel primitives With new libraries With new libraries Parallelize legacy code Parallelize legacy code Mechanical conversion by software tools Mechanical conversion by software tools

75 Mateti, Linux Clusters75 Distributed Programs Spatially distributed programs Spatially distributed programs A part here, a part there, … A part here, a part there, … Parallel Parallel Synergy Synergy Temporally distributed programs Temporally distributed programs Compute half today, half tomorrow Compute half today, half tomorrow Combine the results at the end Combine the results at the end Migratory programs Migratory programs Have computation, will travel Have computation, will travel

76 Mateti, Linux Clusters76 Technological Bases of Distributed+Parallel Programs Spatially distributed programs Spatially distributed programs Message passing Message passing Temporally distributed programs Temporally distributed programs Shared memory Shared memory Migratory programs Migratory programs Serialization of data and programs Serialization of data and programs

77 Mateti, Linux Clusters77 Technological Bases for Migratory programs Same CPU architecture Same CPU architecture X86, PowerPC, MIPS, SPARC, …, JVM X86, PowerPC, MIPS, SPARC, …, JVM Same OS + environment Same OS + environment Be able to “checkpoint” Be able to “checkpoint” suspend, and suspend, and then resume computation then resume computation without loss of progress without loss of progress

78 Mateti, Linux Clusters78 Parallel Programming Languages Shared-memory languages Shared-memory languages Distributed-memory languages Distributed-memory languages Object-oriented languages Object-oriented languages Functional programming languages Functional programming languages Concurrent logic languages Concurrent logic languages Data flow languages Data flow languages

79 Mateti, Linux Clusters79 Linda: Tuple Spaces, shared mem Atomic Primitives Atomic Primitives In (t) In (t) Read (t) Read (t) Out (t) Out (t) Eval (t) Eval (t) Host language: e.g., C/Linda, JavaSpaces Host language: e.g., C/Linda, JavaSpaces

80 Mateti, Linux Clusters80 Data Parallel Languages Data is distributed over the processors as a arrays Data is distributed over the processors as a arrays Entire arrays are manipulated: Entire arrays are manipulated: A(1:100) = B(1:100) + C(1:100) A(1:100) = B(1:100) + C(1:100) Compiler generates parallel code Compiler generates parallel code Fortran 90 Fortran 90 High Performance Fortran (HPF) High Performance Fortran (HPF)

81 Mateti, Linux Clusters81 Parallel Functional Languages Erlang http://www.erlang.org/ Erlang http://www.erlang.org/http://www.erlang.org/ SISAL http://www.llnl.gov/sisal/ SISAL http://www.llnl.gov/sisal/http://www.llnl.gov/sisal/ PCN Argonne PCN Argonne Haskell-Eden http://www.mathematik.uni- marburg.de/~eden Haskell-Eden http://www.mathematik.uni- marburg.de/~edenhttp://www.mathematik.uni- marburg.de/~edenhttp://www.mathematik.uni- marburg.de/~eden Objective Caml with BSP Objective Caml with BSP SAC Functional Array Language SAC Functional Array Language

82 Mateti, Linux Clusters82 Message Passing Libraries Programmer is responsible for initial data distribution, synchronization, and sending and receiving information Programmer is responsible for initial data distribution, synchronization, and sending and receiving information Parallel Virtual Machine (PVM) Parallel Virtual Machine (PVM) Message Passing Interface (MPI) Message Passing Interface (MPI) Bulk Synchronous Parallel model (BSP) Bulk Synchronous Parallel model (BSP)

83 Mateti, Linux Clusters83 BSP: Bulk Synchronous Parallel model Divides computation into supersteps Divides computation into supersteps In each superstep a processor can work on local data and send messages. In each superstep a processor can work on local data and send messages. At the end of the superstep, a barrier synchronization takes place and all processors receive the messages which were sent in the previous superstep At the end of the superstep, a barrier synchronization takes place and all processors receive the messages which were sent in the previous superstep

84 Mateti, Linux Clusters84 BSP: Bulk Synchronous Parallel model http://www.bsp-worldwide.org/ http://www.bsp-worldwide.org/ Book: Rob H. Bisseling, “Parallel Scientific Computation: A Structured Approach using BSP and MPI,” Oxford University Press, 2004, 324 pages, ISBN 0-19-852939-2. Book: Rob H. Bisseling, “Parallel Scientific Computation: A Structured Approach using BSP and MPI,” Oxford University Press, 2004, 324 pages, ISBN 0-19-852939-2.

85 Mateti, Linux Clusters85 BSP Library Small number of subroutines to implement Small number of subroutines to implement process creation, process creation, remote data access, and remote data access, and bulk synchronization. bulk synchronization. Linked to C, Fortran, … programs Linked to C, Fortran, … programs

86 Mateti, Linux Clusters86 Portable Batch System (PBS) Prepare a.cmd file Prepare a.cmd file naming the program and its arguments naming the program and its arguments properties of the job properties of the job the needed resources the needed resources Submit.cmd to the PBS Job Server: qsub command Submit.cmd to the PBS Job Server: qsub command Routing and Scheduling: The Job Server Routing and Scheduling: The Job Server examines.cmd details to route the job to an execution queue. examines.cmd details to route the job to an execution queue. allocates one or more cluster nodes to the job allocates one or more cluster nodes to the job communicates with the Execution Servers (mom's) on the cluster to determine the current state of the nodes. communicates with the Execution Servers (mom's) on the cluster to determine the current state of the nodes. When all of the needed are allocated, passes the.cmd on to the Execution Server on the first node allocated (the "mother superior"). When all of the needed are allocated, passes the.cmd on to the Execution Server on the first node allocated (the "mother superior"). Execution Server Execution Server will login on the first node as the submitting user and run the.cmd file in the user's home directory. will login on the first node as the submitting user and run the.cmd file in the user's home directory. Run an installation defined prologue script. Run an installation defined prologue script. Gathers the job's output to the standard output and standard error Gathers the job's output to the standard output and standard error It will execute installation defined epilogue script. It will execute installation defined epilogue script. Delivers stdout and stdout to the user. Delivers stdout and stdout to the user.

87 Mateti, Linux Clusters87 TORQUE, an open source PBS Tera-scale Open-source Resource and QUEue manager (TORQUE) enhances OpenPBS Tera-scale Open-source Resource and QUEue manager (TORQUE) enhances OpenPBS Fault Tolerance Fault Tolerance Additional failure conditions checked/handled Additional failure conditions checked/handled Node health check script support Node health check script support Scheduling Interface Scheduling Interface Scalability Scalability Significantly improved server to MOM communication model Significantly improved server to MOM communication model Ability to handle larger clusters (over 15 TF/2,500 processors) Ability to handle larger clusters (over 15 TF/2,500 processors) Ability to handle larger jobs (over 2000 processors) Ability to handle larger jobs (over 2000 processors) Ability to support larger server messages Ability to support larger server messages Logging Logging http://www.supercluster.org/projects/torque/ http://www.supercluster.org/projects/torque/

88 Mateti, Linux Clusters88 PVM, and MPI Message passing primitives Message passing primitives Can be embedded in many existing programming languages Can be embedded in many existing programming languages Architecturally portable Architecturally portable Open-sourced implementations Open-sourced implementations

89 Mateti, Linux Clusters89 Parallel Virtual Machine (PVM) PVM enables a heterogeneous collection of networked computers to be used as a single large parallel computer. PVM enables a heterogeneous collection of networked computers to be used as a single large parallel computer. Older than MPI Older than MPI Large scientific/engineering user community Large scientific/engineering user community http://www.csm.ornl.gov/pvm/ http://www.csm.ornl.gov/pvm/ http://www.csm.ornl.gov/pvm/

90 Mateti, Linux Clusters90 Message Passing Interface (MPI) http://www-unix.mcs.anl.gov/mpi/ http://www-unix.mcs.anl.gov/mpi/ http://www-unix.mcs.anl.gov/mpi/ MPI-2.0 http://www.mpi-forum.org/docs/ MPI-2.0 http://www.mpi-forum.org/docs/http://www.mpi-forum.org/docs/ MPICH: www.mcs.anl.gov/mpi/mpich/ by Argonne National Laboratory and Missisippy State University MPICH: www.mcs.anl.gov/mpi/mpich/ by Argonne National Laboratory and Missisippy State Universitywww.mcs.anl.gov/mpi/mpich/ LAM: http://www.lam-mpi.org/ LAM: http://www.lam-mpi.org/http://www.lam-mpi.org/ http://www.open-mpi.org/ http://www.open-mpi.org/ http://www.open-mpi.org/

91 Mateti, Linux Clusters91 OpenMP for shared memory Distributed shared memory API Distributed shared memory API User-gives hints as directives to the compiler User-gives hints as directives to the compiler http://www.openmp.org http://www.openmp.org

92 Mateti, Linux Clusters92 SPMD Single program, multiple data Single program, multiple data Contrast with SIMD Contrast with SIMD Same program runs on multiple nodes Same program runs on multiple nodes May or may not be lock-step May or may not be lock-step Nodes may be of different speeds Nodes may be of different speeds Barrier synchronization Barrier synchronization

93 Mateti, Linux Clusters93 Condor Cooperating workstations: come and go. Cooperating workstations: come and go. Migratory programs Migratory programs Checkpointing Checkpointing Remote IO Remote IO Resource matching Resource matching http://www.cs.wisc.edu/condor/ http://www.cs.wisc.edu/condor/

94 Mateti, Linux Clusters94 Migration of Jobs Policies Policies Immediate-Eviction Immediate-Eviction Pause-and-Migrate Pause-and-Migrate Technical Issues Technical Issues Check-pointing: Preserving the state of the process so it can be resumed. Check-pointing: Preserving the state of the process so it can be resumed. Migrating from one architecture to another Migrating from one architecture to another

95 Mateti, Linux Clusters95 Kernels Etc Mods for Clusters Dynamic load balancing Dynamic load balancing Transparent process-migration Transparent process-migration Kernel Mods Kernel Mods http://openmosix.sourceforge.net/ http://openmosix.sourceforge.net/ http://openmosix.sourceforge.net/ http://kerrighed.org/ http://kerrighed.org/ http://kerrighed.org/ http://openssi.org/ http://openssi.org/ http://ci-linux.sourceforge.net/ http://ci-linux.sourceforge.net/ CLuster Membership Subsystem ("CLMS") and CLuster Membership Subsystem ("CLMS") and Internode Communication Subsystem Internode Communication Subsystem http://www.gluster.org/ http://www.gluster.org/ http://www.gluster.org/ GlusterFS: Clustered File Storage of peta bytes. GlusterFS: Clustered File Storage of peta bytes. GlusterHPC: High Performance Compute Clusters GlusterHPC: High Performance Compute Clusters http://boinc.berkeley.edu/ http://boinc.berkeley.edu/ http://boinc.berkeley.edu/ Open-source software for volunteer computing and grid computing Open-source software for volunteer computing and grid computing

96 Mateti, Linux Clusters96 OpenMosix Distro Quantian Linux Quantian Linux Boot from DVD-ROM Boot from DVD-ROM Compressed file system on DVD Compressed file system on DVD Several GB of cluster software Several GB of cluster software http://dirk.eddelbuettel.com/quantian.html http://dirk.eddelbuettel.com/quantian.html http://dirk.eddelbuettel.com/quantian.html Live CD/DVD or Single Floppy Bootables Live CD/DVD or Single Floppy Bootables http://bofh.be/clusterknoppix/ http://bofh.be/clusterknoppix/ http://sentinix.org/ http://sentinix.org/ http://itsecurity.mq.edu.au/chaos/ http://itsecurity.mq.edu.au/chaos/ http://openmosixloaf.sourceforge.net/ http://openmosixloaf.sourceforge.net/ http://plumpos.sourceforge.net/ http://plumpos.sourceforge.net/ http://www.dynebolic.org/ http://www.dynebolic.org/ http://bccd.cs.uni.edu/ http://bccd.cs.uni.edu/ http://eucaristos.sourceforge.net/ http://eucaristos.sourceforge.net/ http://gomf.sourceforge.net/ http://gomf.sourceforge.net/ Can be installed on HDD Can be installed on HDD

97 Mateti, Linux Clusters97 What is openMOSIX? An open source enhancement to the Linux kernel An open source enhancement to the Linux kernel Cluster with come-and-go nodes Cluster with come-and-go nodes System image model: Virtual machine with lots of memory and CPU System image model: Virtual machine with lots of memory and CPU Granularity: Process Granularity: Process Improves the overall (cluster-wide) performance. Improves the overall (cluster-wide) performance. Multi-user, time-sharing environment for the execution of both sequential and parallel applications Multi-user, time-sharing environment for the execution of both sequential and parallel applications Applications unmodified (no need to link with special library) Applications unmodified (no need to link with special library)

98 Mateti, Linux Clusters98 What is openMOSIX? Execution environment: Execution environment: farm of diskless x86 based nodes farm of diskless x86 based nodes UP (uniprocessor), or UP (uniprocessor), or SMP (symmetric multi processor) SMP (symmetric multi processor) connected by standard LAN (e.g., Fast Ethernet) connected by standard LAN (e.g., Fast Ethernet) Adaptive resource management to dynamic load characteristics Adaptive resource management to dynamic load characteristics CPU, RAM, I/O, etc. CPU, RAM, I/O, etc. Linear scalability Linear scalability

99 Mateti, Linux Clusters99 Users’ View of the Cluster Users can start from any node in the cluster, or sysadmin sets-up a few nodes as login nodes Users can start from any node in the cluster, or sysadmin sets-up a few nodes as login nodes Round-robin DNS: “hpc.clusters” with many IPs assigned to same name Round-robin DNS: “hpc.clusters” with many IPs assigned to same name Each process has a Home-Node Each process has a Home-Node Migrated processes always appear to run at the home node, e.g., “ps” show all your processes, even if they run elsewhere Migrated processes always appear to run at the home node, e.g., “ps” show all your processes, even if they run elsewhere

100 Mateti, Linux Clusters100 MOSIX architecture network transparency network transparency preemptive process migration preemptive process migration dynamic load balancing dynamic load balancing memory sharing memory sharing efficient kernel communication efficient kernel communication probabilistic information dissemination algorithms probabilistic information dissemination algorithms decentralized control and autonomy decentralized control and autonomy

101 Mateti, Linux Clusters101 A two tier technology Information gathering and dissemination Information gathering and dissemination Support scalable configurations by probabilistic dissemination algorithms Support scalable configurations by probabilistic dissemination algorithms Same overhead for 16 nodes or 2056 nodes Same overhead for 16 nodes or 2056 nodes Pre-emptive process migration that can migrate any process, anywhere, anytime - transparently Pre-emptive process migration that can migrate any process, anywhere, anytime - transparently Supervised by adaptive algorithms that respond to global resource availability Supervised by adaptive algorithms that respond to global resource availability Transparent to applications, no change to user interface Transparent to applications, no change to user interface

102 Mateti, Linux Clusters102 Tier 1: Information gathering and dissemination In each unit of time (e.g., 1 second) each node gathers information about: In each unit of time (e.g., 1 second) each node gathers information about: CPU(s) speed, load and utilization CPU(s) speed, load and utilization Free memory Free memory Free proc-table/file-table slots Free proc-table/file-table slots Info sent to a randomly selected node Info sent to a randomly selected node Scalable - more nodes better scattering Scalable - more nodes better scattering

103 Mateti, Linux Clusters103 Tier 2: Process migration Load balancing: reduce variance between pairs of nodes to improve the overall performance Load balancing: reduce variance between pairs of nodes to improve the overall performance Memory ushering: migrate processes from a node that nearly exhausted its free memory, to prevent paging Memory ushering: migrate processes from a node that nearly exhausted its free memory, to prevent paging Parallel File I/O: bring the process to the file-server, direct file I/O from migrated processes Parallel File I/O: bring the process to the file-server, direct file I/O from migrated processes

104 Mateti, Linux Clusters104 Network transparency The user and applications are provided a virtual machine that looks like a single machine. The user and applications are provided a virtual machine that looks like a single machine. Example: Disk access from diskless nodes on fileserver is completely transparent to programs Example: Disk access from diskless nodes on fileserver is completely transparent to programs

105 Mateti, Linux Clusters105 Preemptive process migration Any user’s process, trasparently and at any time, can/may migrate to any other node. Any user’s process, trasparently and at any time, can/may migrate to any other node. The migrating process is divided into: The migrating process is divided into: system context (deputy) that may not be migrated from home workstation (UHN); system context (deputy) that may not be migrated from home workstation (UHN); user context (remote) that can be migrated on a diskless node; user context (remote) that can be migrated on a diskless node;

106 Mateti, Linux Clusters106 Splitting the Linux process System context (environment) - site dependent- “home” confined System context (environment) - site dependent- “home” confined Connected by an exclusive link for both synchronous (system calls) and asynchronous (signals, MOSIX events) Connected by an exclusive link for both synchronous (system calls) and asynchronous (signals, MOSIX events) Process context (code, stack, data) - site independent - may migrate Process context (code, stack, data) - site independent - may migrate Deputy Remote Kernel Userland openMOSIX Link Local master node diskless node

107 Mateti, Linux Clusters107 Dynamic load balancing Initiates process migrations in order to balance the load of farm Initiates process migrations in order to balance the load of farm responds to variations in the load of the nodes, runtime characteristics of the processes, number of nodes and their speeds responds to variations in the load of the nodes, runtime characteristics of the processes, number of nodes and their speeds makes continuous attempts to reduce the load differences among nodes makes continuous attempts to reduce the load differences among nodes the policy is symmetrical and decentralized the policy is symmetrical and decentralized all of the nodes execute the same algorithm all of the nodes execute the same algorithm the reduction of the load differences is performed indipendently by any pair of nodes the reduction of the load differences is performed indipendently by any pair of nodes

108 Mateti, Linux Clusters108 Memory sharing places the maximal number of processes in the farm main memory, even if it implies an uneven load distribution among the nodes places the maximal number of processes in the farm main memory, even if it implies an uneven load distribution among the nodes delays as much as possible swapping out of pages delays as much as possible swapping out of pages makes the decision of which process to migrate and where to migrate it is based on the knoweldge of the amount of free memory in other nodes makes the decision of which process to migrate and where to migrate it is based on the knoweldge of the amount of free memory in other nodes

109 Mateti, Linux Clusters109 Efficient kernel communication Reduces overhead of the internal kernel communications (e.g. between the process and its home site, when it is executing in a remote site) Reduces overhead of the internal kernel communications (e.g. between the process and its home site, when it is executing in a remote site) Fast and reliable protocol with low startup latency and high throughput Fast and reliable protocol with low startup latency and high throughput

110 Mateti, Linux Clusters110 Probabilistic information dissemination algorithms Each node has sufficient knowledge about available resources in other nodes, without polling Each node has sufficient knowledge about available resources in other nodes, without polling measure the amount of available resources on each node measure the amount of available resources on each node receive resources indices that each node sends at regular intervals to a randomly chosen subset of nodes receive resources indices that each node sends at regular intervals to a randomly chosen subset of nodes the use of randomly chosen subset of nodes facilitates dynamic configuration and overcomes node failures the use of randomly chosen subset of nodes facilitates dynamic configuration and overcomes node failures

111 Mateti, Linux Clusters111 Decentralized control and autonomy Each node makes its own control decisions independently. Each node makes its own control decisions independently. No master-slave relationships No master-slave relationships Each node is capable of operating as an independent system Each node is capable of operating as an independent system Nodes may join or leave the farm with minimal disruption Nodes may join or leave the farm with minimal disruption

112 Mateti, Linux Clusters112 File System Access MOSIX is particularly efficient for distributing and executing CPU-bound processes MOSIX is particularly efficient for distributing and executing CPU-bound processes However, the processes are inefficient with significant file operations However, the processes are inefficient with significant file operations I/O accesses through the home node incur high overhead I/O accesses through the home node incur high overhead “Direct FSA” is for better handling of I/O: “Direct FSA” is for better handling of I/O: Reduce the overhead of executing I/O oriented system-calls of a migrated process Reduce the overhead of executing I/O oriented system-calls of a migrated process a migrated process performs I/O operations locally, in the current node, not via the home node a migrated process performs I/O operations locally, in the current node, not via the home node processes migrate more freely processes migrate more freely

113 Mateti, Linux Clusters113 DFSA Requirements DFSA can work with any file system that satisfies some properties. DFSA can work with any file system that satisfies some properties. Unique mount point: The FS are identically mounted on all. Unique mount point: The FS are identically mounted on all. File consistency: when an operation is completed in one node, any subsequent operation on any other node will see the results of that operation File consistency: when an operation is completed in one node, any subsequent operation on any other node will see the results of that operation Required because an openMOSIX process may perform consecutive syscalls from different nodes Required because an openMOSIX process may perform consecutive syscalls from different nodes Time-stamp consistency: if file A is modified after B, A must have a timestamp > B's timestamp Time-stamp consistency: if file A is modified after B, A must have a timestamp > B's timestamp

114 Mateti, Linux Clusters114 DFSA Conforming FS Global File System (GFS) Global File System (GFS) openMOSIX File System (MFS) openMOSIX File System (MFS) Lustre global file system Lustre global file system General Parallel File System (GPFS) General Parallel File System (GPFS) Parallel Virtual File System (PVFS) Parallel Virtual File System (PVFS) Available operations: all common file- system and I/O system-calls Available operations: all common file- system and I/O system-calls

115 Mateti, Linux Clusters115 Global File System (GFS) Provides local caching and cache consistency over the cluster using a unique locking mechanism Provides local caching and cache consistency over the cluster using a unique locking mechanism Provides direct access from any node to any storage entity Provides direct access from any node to any storage entity GFS + process migration combine the advantages of load-balancing with direct disk access from any node - for parallel file operations GFS + process migration combine the advantages of load-balancing with direct disk access from any node - for parallel file operations Non-GNU License (SPL) Non-GNU License (SPL)

116 Mateti, Linux Clusters116 The MOSIX File System (MFS) Provides a unified view of all files and all mounted FSs on all the nodes of a MOSIX cluster as if they were within a single file system. Provides a unified view of all files and all mounted FSs on all the nodes of a MOSIX cluster as if they were within a single file system. Makes all directories and regular files throughout an openMOSIX cluster available from all the nodes Makes all directories and regular files throughout an openMOSIX cluster available from all the nodes Provides cache consistency Provides cache consistency Allows parallel file access by proper distribution of files (a process migrates to the node with the needed files) Allows parallel file access by proper distribution of files (a process migrates to the node with the needed files)

117 Mateti, Linux Clusters117 MFS Namespace / etcusrvarbin mfs / etcusrvarbinmfs

118 Mateti, Linux Clusters118 Lustre: A scalable File System http://www.lustre.org/ http://www.lustre.org/ http://www.lustre.org/ Scalable data serving through parallel data striping Scalable data serving through parallel data striping Scalable meta data Scalable meta data Separation of file meta data and storage allocation meta data to further increase scalability Separation of file meta data and storage allocation meta data to further increase scalability Object technology - allowing stackable, value- add functionality Object technology - allowing stackable, value- add functionality Distributed operation Distributed operation

119 Mateti, Linux Clusters119 Parallel Virtual File System (PVFS) http://www.parl.clemson.edu/pvfs/ http://www.parl.clemson.edu/pvfs/ http://www.parl.clemson.edu/pvfs/ User-controlled striping of files across nodes User-controlled striping of files across nodes Commodity network and storage hardware Commodity network and storage hardware MPI-IO support through ROMIO MPI-IO support through ROMIO Traditional Linux file system access through the pvfs-kernel package Traditional Linux file system access through the pvfs-kernel package The native PVFS library interface The native PVFS library interface

120 Mateti, Linux Clusters120 General Parallel File Sys (GPFS) www.ibm.com/servers/eserver/clusters/software/ gpfs.html www.ibm.com/servers/eserver/clusters/software/ gpfs.html www.ibm.com/servers/eserver/clusters/software/ gpfs.html www.ibm.com/servers/eserver/clusters/software/ gpfs.html “GPFS for Linux provides world class performance, scalability, and availability for file systems. It offers compliance to most UNIX file standards for end user applications and administrative extensions for ongoing management and tuning. It scales with the size of the Linux cluster and provides NFS Export capabilities outside the cluster.” “GPFS for Linux provides world class performance, scalability, and availability for file systems. It offers compliance to most UNIX file standards for end user applications and administrative extensions for ongoing management and tuning. It scales with the size of the Linux cluster and provides NFS Export capabilities outside the cluster.”

121 Mateti, Linux Clusters121 Mosix Ancillary Tools Kernel debugger Kernel debugger Kernel profiler Kernel profiler Parallel make (all exec() become mexec()) Parallel make (all exec() become mexec()) openMosix pvm openMosix pvm openMosix mm5 openMosix mm5 openMosix HMMER openMosix HMMER openMosix Mathematica openMosix Mathematica

122 Mateti, Linux Clusters122 Cluster Administration LTSP (www.ltsp.org) LTSP (www.ltsp.org) ClumpOs (www.clumpos.org) ClumpOs (www.clumpos.org)www.clumpos.org Mps Mps Mtop Mtop Mosctl Mosctl

123 Mateti, Linux Clusters123 Mosix commands & files setpe – starts and stops Mosix on the current node setpe – starts and stops Mosix on the current node tune – calibrates the node speed parameters tune – calibrates the node speed parameters mtune – calibrates the node MFS parameters mtune – calibrates the node MFS parameters migrate – forces a process to migrate migrate – forces a process to migrate mosctl – comprehensive Mosix administration tool mosctl – comprehensive Mosix administration tool mosrun, nomig, runhome, runon, cpujob, iojob, nodecay, fastdecay, slowdecay – various way to start a program in a specific way mosrun, nomig, runhome, runon, cpujob, iojob, nodecay, fastdecay, slowdecay – various way to start a program in a specific way mon & mosixview – CLI and graphic interface to monitor the cluster status mon & mosixview – CLI and graphic interface to monitor the cluster status /etc/mosix.map – contains the IP numbers of the cluster nodes /etc/mosix.map – contains the IP numbers of the cluster nodes /etc/mosgates – contains the number of gateway nodes present in the cluster /etc/mosgates – contains the number of gateway nodes present in the cluster /etc/overheads – contains the output of the ‘tune’ command to be loaded at startup /etc/overheads – contains the output of the ‘tune’ command to be loaded at startup /etc/mfscosts – contains the output of the ‘mtune’ command to be loaded at startup /etc/mfscosts – contains the output of the ‘mtune’ command to be loaded at startup /proc/mosix/admin/* - various files, sometimes binary, to check and control Mosix /proc/mosix/admin/* - various files, sometimes binary, to check and control Mosix

124 Mateti, Linux Clusters124 Monitoring Cluster monitor - ‘mosmon’(or ‘qtop’) Cluster monitor - ‘mosmon’(or ‘qtop’) Displays load, speed, utilization and memory information across the cluster. Displays load, speed, utilization and memory information across the cluster. Uses the /proc/hpc/info interface for the retrieving information Uses the /proc/hpc/info interface for the retrieving information Applet/CGI based monitoring tools - display cluster properties Applet/CGI based monitoring tools - display cluster properties Access via the Internet Access via the Internet Multiple resources Multiple resources openMosixview with X GUI openMosixview with X GUI

125 Mateti, Linux Clusters125 openMosixview by Mathias Rechemburg by Mathias Rechemburg www.mosixview.com www.mosixview.com

126 Mateti, Linux Clusters126 Qlusters OS http://www.qlusters.com/ http://www.qlusters.com/ http://www.qlusters.com/ Based in part on openMosix technology Based in part on openMosix technology Migrating sockets Migrating sockets Network RAM already implemented Network RAM already implemented Cluster Installer, Configurator, Monitor, Queue Manager, Launcher, Scheduler Cluster Installer, Configurator, Monitor, Queue Manager, Launcher, Scheduler Partnership with IBM, Compaq, Red Hat and Intel Partnership with IBM, Compaq, Red Hat and Intel

127 Mateti, Linux Clusters127 QlusterOS Monitor

128 Mateti, Linux Clusters128 More Information on Clusters www.ieeetfcc.org/ IEEE Task Force on Cluster Computing. (now Technical Committee on Scalable Computing TCSC). www.ieeetfcc.org/ IEEE Task Force on Cluster Computing. (now Technical Committee on Scalable Computing TCSC). www.ieeetfcc.org/TCSC www.ieeetfcc.org/TCSC lcic.org/ “a central repository of links and information regarding Linux clustering, in all its forms.” lcic.org/ “a central repository of links and information regarding Linux clustering, in all its forms.” lcic.org/ www.beowulf.org resources for of clusters built on commodity hardware deploying Linux OS and open source software. www.beowulf.org resources for of clusters built on commodity hardware deploying Linux OS and open source software. www.beowulf.org linuxclusters.com/ “Authoritative resource for information on Linux Compute Clusters and Linux High Availability Clusters.” linuxclusters.com/ “Authoritative resource for information on Linux Compute Clusters and Linux High Availability Clusters.” linuxclusters.com/ www.linuxclustersinstitute.org/ “To provide education and advanced technical training for the deployment and use of Linux-based computing clusters to the high-performance computing community worldwide.” www.linuxclustersinstitute.org/ “To provide education and advanced technical training for the deployment and use of Linux-based computing clusters to the high-performance computing community worldwide.” www.linuxclustersinstitute.org/

129 Mateti, Linux Clusters129 Code-Granularity Code Item Large grain (task level) Program Medium grain (control level) Function (thread) Fine grain (data level) Loop (Compiler) Very fine grain (multiple issue) With hardware Task i-l Task i Task i+1 func1 ( ) {.... } func1 ( ) {.... } func2 ( ) {.... } func2 ( ) {.... } func3 ( ) {.... } func3 ( ) {.... } a ( 0 ) =.. b ( 0 ) =.. a ( 0 ) =.. b ( 0 ) =.. a ( 1 )=.. b ( 1 )=.. a ( 1 )=.. b ( 1 )=.. a ( 2 )=.. b ( 2 )=.. a ( 2 )=.. b ( 2 )=.. + + x x Load PVM/MPI Threads Compilers CPU Levels of Parallelism


Download ppt "Cluster Computing with Linux Prabhaker Mateti Wright State University."

Similar presentations


Ads by Google