Presentation is loading. Please wait.

Presentation is loading. Please wait.

March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.

Similar presentations


Presentation on theme: "March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE."— Presentation transcript:

1 March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE

2 March 3rd, 2006 Chen Peng, Lilly System Biology2 Objectives To bring awareness of our resources To explore the possibilities Not a hand-on training for SGE No details on parallel programming and bix environment

3 March 3rd, 2006 Chen Peng, Lilly System Biology3 Agenda Introduction Cluster@LSB Hardware/software; What can we do and what can not ? Working with cluster SGE: power and limitation; How cluster can help us? Use SGE to manage jobs as a non-privileged user; Why cluster can be evil? Q/A

4 March 3rd, 2006 Chen Peng, Lilly System Biology4 Cluster, cheap super-computing Features: Composed with commodity PC Inter-connected with high speed network Resources managed by OS or software Handling complex tasks as a single unit High scalability Low cost: a top 10 HPC cluster (2400 CPUs) with less than 6 million dollars

5 March 3rd, 2006 Chen Peng, Lilly System Biology5 A cluster in Singapore 64 nodes (2cpu/node) Gigabit Ethernet Extended to 75 nodes and 80 nodes Managed by LSF Implemented with less than 400K SGD

6 March 3rd, 2006 Chen Peng, Lilly System Biology6 Clusters: quick overview Various Types: High Availability (DB, file server) Load Balancing (Web server, search engine) High Performance Computing (our focus today)

7 March 3rd, 2006 Chen Peng, Lilly System Biology7 HPC Cluster: Beowulf Beowulf First implementation for cluster computing Taget at parallel computer tasks Need fast inter-connections Develop parallel code MPI (Message Passing Interface) PVM (Parallel Virtual Machine)

8 March 3rd, 2006 Chen Peng, Lilly System Biology8 HPC Cluster: Mosix Mosix/OpenMosix/openSSI Based on Beowulf implementation SSI: Single System Image OS shared by all the nodes kernel level integration Automatic process migration among nodes Ideal for parallel tasks and IO intensive work Still need to develop parallel code

9 March 3rd, 2006 Chen Peng, Lilly System Biology9 HPC Cluster: Compute farm (1) Jobs are managed by Distributed Resource Management (DRM) system (LSF, SGE, PBS, Torque).

10 March 3rd, 2006 Chen Peng, Lilly System Biology10 HPC Cluster: Compute Farm (2) Embarrassingly parallel: Large numbers of nearly identical jobs, with different input or parameters; User or user-space application prepares the input or paremeters; Little inter-job communication, mostly independent tasks; High Throughput Blast.

11 March 3rd, 2006 Chen Peng, Lilly System Biology11 Introduction: summary Features Various types HPC architectures compared Beowulf: inter-connected PC to run parallel code using MPI/PVM Mosix: kernel integration for process migration Compute farm: run many smiliar jobs in a “ embarrassingly parallel ” manner

12 March 3rd, 2006 Chen Peng, Lilly System Biology12 Agenda Introduction Cluster@LSB Hardware/software; What can we do and what can not ? Working with cluster SGE: power and limitation; How cluster can help us? Use SGE to manage jobs as a non-privileged user; Why cluster can be evil? Q/A

13 March 3rd, 2006 Chen Peng, Lilly System Biology13 Hardware It is a comput farm Head node (pecos) 2xIntel P3 1.2Ghz, 4GB 32 Compute node 2xIntel P3 1.2Ghz, 2GB 100Mb Ethernet

14 March 3rd, 2006 Chen Peng, Lilly System Biology14 Software OS: RedHat 9 Sun Grid Engine (SGE) MPI libraries: LAM/MPI, MPICH PVM libraries: PVM-v3 Parallel computing packages for R: Rmpi, rpmv, SNOW Matlab for distributed computing Coming soon... Parallel blast implemented by “ Scalable System ” A general command line interface Wrappers for HT Blast and mpiBLAST

15 March 3rd, 2006 Chen Peng, Lilly System Biology15 About current configuration 64 CPUs are available Limited memory on each node, not shared Slow inter-connections What ’ s the meaning of all these? Capable of managing large amount of jobs, each job has little communications with others Not recommended for serious parallel job, but may work for proof-of-concept task

16 March 3rd, 2006 Chen Peng, Lilly System Biology16 Agenda Introduction Cluster@LSB Hardware/software; What can we do and what can not ? Working with cluster SGE: power and limitation; How cluster can help us? Use SGE to manage jobs as a non-privileged user; Why cluster may be evil? Q/A

17 March 3rd, 2006 Chen Peng, Lilly System Biology17 Sun Grid Engine Sun Grid Engine is a Distributed Resource Management (DRM) software. It is helpful to: optimally place computing tasks allow users to queue more computing tasks ensure that tasks are executed fairly with respect to priority

18 March 3rd, 2006 Chen Peng, Lilly System Biology18 You might have been like this... User 1 has 100 analysis. It will take 20 hours to run serially on volga, so he runs part of jobs to hudson … Evil user2 is running heavy programs on hudson and it hooks up all the CPUs!

19 March 3rd, 2006 Chen Peng, Lilly System Biology19 Life could be easier with cluster User submits 100 analysis to cluster using SGE SGE finds the best available node to run the analysis The results could return in less than one hour!

20 March 3rd, 2006 Chen Peng, Lilly System Biology20 How can cluster help us? Use SGE to help manage analysis Eg. run SIG3 with 23 cell lines (90+ runs) What we did: Three people: each manages 30+ runs (around 3 hours); What we could do: One person prepares the SGE jobs in 10 minutes, submits and gets results in half hour;

21 March 3rd, 2006 Chen Peng, Lilly System Biology21 How can cluster help us? (cont ’ d) Use cluster/SGE to speed up analysis; Annotation pipeline SRS indexing Explore the parallel code; Random forest algorithm, etc.

22 March 3rd, 2006 Chen Peng, Lilly System Biology22 Commonly used SGE cmds Submit job: qsub Cancel a job: qdel Check my job: qstat –j jobID Check queue status: qstat Check cluster status: qhost

23 March 3rd, 2006 Chen Peng, Lilly System Biology23 SGE Demo Ramneek ’ s script: extract_gene.pl It took 3.5-4 minutes to run with four genes. We will run the script in parallel and complete the analysis in one minute.

24 March 3rd, 2006 Chen Peng, Lilly System Biology24 Demo: script “ sge_run ” It is (4-line) shell wrapper for “ extract_gene.pl ” : Split the input file into smaller pieces Submit the jobs to SGE

25 March 3rd, 2006 Chen Peng, Lilly System Biology25 Demo: run the script

26 March 3rd, 2006 Chen Peng, Lilly System Biology26 Sun Grid Engine: Limitations Embarrassingly parallel: jobs need to prepared. Submission host != execution host. User needs to redirect output to files in shared location. It is difficult to debug cluster jobs. Limited support for automated job recovery.

27 March 3rd, 2006 Chen Peng, Lilly System Biology27 The evil side Potential threat to the entire system Concurrent requests may burn the file server or database server Network traffic (switch, router etc) How to fix? User education Good management practice Admin validates in-house developed code

28 March 3rd, 2006 Chen Peng, Lilly System Biology28 Cluster and SGE: summary SGE: powerful, but with limitations How cluster may help us? Importance to use the cluster correctly

29 March 3rd, 2006 Chen Peng, Lilly System Biology29


Download ppt "March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE."

Similar presentations


Ads by Google