Presentation on theme: "SAN DIEGO SUPERCOMPUTER CENTER Blue Gene for Protein Structure Prediction (Predicting CASP Targets in Record Time) Ross C. Walker."— Presentation transcript:
SAN DIEGO SUPERCOMPUTER CENTER Blue Gene for Protein Structure Prediction (Predicting CASP Targets in Record Time) Ross C. Walker
SAN DIEGO SUPERCOMPUTER CENTER The CASP Competition What is CASP? Critical Assessment of Techniques for Protein Structure Prediction (CASP) Biennial competition in protein structure prediction “world cup” of protein structure prediction CASP v7 ran 10 th May 2006 to 29 th Aug 2006 ca. 100 sequences over 100 days
SAN DIEGO SUPERCOMPUTER CENTER Protein Structure Prediction (Rosetta) Homology Modeling (Large sequence alignment) Template Based Modeling (Some sequence alignment) Ab Initio (No appreciable sequence alignment) The Rosetta Code of Prof. David Baker (HHMI) Supports all 3 Approaches
SAN DIEGO SUPERCOMPUTER CENTER Template Based Predictions Used for the majority of CASP targets Align sequence with proteins of known structure Generate initial “decoy” structures Do a monte-carlo refinement of the structures Structures with lowest energy “should” be the native structure.
SAN DIEGO SUPERCOMPUTER CENTER
The Problem Many thousands of refinements need to be completed in order to adequately sample phase space. CASP competition is time sensitive Sequences released continuously Predictions must be submitted within 3 weeks of sequence release Requires access to large computing resources.
SAN DIEGO SUPERCOMPUTER CENTER SDSC and Rosetta A collaboration between SDSC’s Scientific Applications Computing (SAC) group and David Baker Scientists from SDSC parallelized the Rosetta code to run on many thousands of processors Provided tailored resource allocation on SDSC Blue Gene and DataStar machines Provided the Baker team with access to 2 orders of magnitude more computing power than they had for CASP 6 (2004).
SAN DIEGO SUPERCOMPUTER CENTER Rosetta Modifications
SAN DIEGO SUPERCOMPUTER CENTER Modifications Specific to Blue Gene 1)Aggressively account for all memory used. 2)Variable Chunk Size Distribution by Master Thread. 3)No Global Communications - All point to point. 4)Distributed I/O - All tasks read directly from disk and write directly to disk. (No distribution of work packets over interconnect - overloads master thread. Only Job ID info sent) 5)Master generation of random seed for each slave thread - ensures no two threads have the same random seed.
SAN DIEGO SUPERCOMPUTER CENTER Performance
SAN DIEGO SUPERCOMPUTER CENTER Rosetta Usage on SDSC Blue Gene CASP 2006 1,080,000 SUs used (Average run size = 2048 cpus) 2007 (Estimated) Protein Structure Prediction2,500,000 SUs (4096 cpus) Protein Design1,800,000 SUs (2048 cpus)
SAN DIEGO SUPERCOMPUTER CENTER A Demonstration Successful scaling to >40,000 processors allowed a demonstration to be run at IBM Watson Research Labs Ross Walker (SDSC) and Srivatsan Raman (UW) took a CASP target released earlier in the day Generated Initial Guesses Submitted Job to all 20 racks of IBM Watson Blue Gene Ran for 3 hours Generated 120,000 Decoys Best candidate was selected and submitted as CASP prediction the same day.
SAN DIEGO SUPERCOMPUTER CENTER Results CASP 2006 Target T0380 Green = Prediction Blue = X-Ray Pink = Initial Template
SAN DIEGO SUPERCOMPUTER CENTER Results CASP 2006 Target T0380 Baker team results shown in black.
SAN DIEGO SUPERCOMPUTER CENTER The Future (1 million CPUs and beyond) Hierarchical Job Distribution System (1 master thread approach will be overloaded). On the fly detection of failed nodes and error correction. Manual Buffering of I/O? [Requires more memory per node] Parallelization of individual refinements. (SMP or MPI options)
SAN DIEGO SUPERCOMPUTER CENTER Acknowledgements David Baker (UW) Srivatsan Raman (UW) John Karanicolas (UW) IBM T.J.Watson Research SDSC NSF Funded SAC Program