Presentation on theme: "Overview of IU activities in supercomputing, grids, and computational biology Dr. Craig A. Stewart Director, Research and Academic Computing,"— Presentation transcript:
Overview of IU activities in supercomputing, grids, and computational biology Dr. Craig A. Stewart email@example.com Director, Research and Academic Computing, University Information Technology Services and Director, Information Technology Core, Indiana Genomics Initiative 14 May 2003
Outline Some background about IU Overview of IU advanced IT environment –Networking –Visualization –Massive Data Storage –HPC IU efforts in bioinformatics/genomics/etc. IU strategies in HPC And what I’m doing here
IU in a nutshell $2B Annual Budget One university with 8 campuses 90,000 students 3,900 faculty 878 degree programs; > 100 programs ranked within top 20 of their type nationally Nation’s 2nd largest school of medicine Traditional Fußball (auf American: soccer) powerhouse; 5-time NCAA champions State of Indiana is also a national leader in jobs lost!
IT@IU in a nutshell CIO: Vice President Michael A. McRobbie ~$100M annual budget Technology services offered university- wide Networking –IU Operates network Operations Center for Abilene –IU and Purdue University jointly own in-state network High Performance Computing –First university in US to own a 1 TFLOPS supercomputer –Recently achieved 1 TFLOPS with Linpack benchmark on our distributed Linux cluster –Top 500 list has for past several years included at least one IU supercomputer
I-light Network jointly owned by Indiana University and Purdue University 36 fibers between Bloomington and Indianapolis (IU’s main campuses) 24 fibers between Indianapolis and West Lafayette (Purdue’s main campus) Co-location with Abilene GigaPOP Funded by special appropriation from State of Indiana. Expansion to other universities recently funded
Advanced Visualization Laboratory Hardware environments –CAVE TM – 3-wall, immersive 3D environment, still based on SGI equipment –Immersadesk TM - Furniture-scale immersive 3D environment, still based on SGI equipment –John-E-Box TM – IU designed, low-cost passive 3D device. Uses Intel-based computers (typically running Linux). Cost ~$35,000 US to build. Software development –Collaborative software based on constrained navigation –Biomedical research applications
John-E-Box Invented by John N. Huffman, John C. Huffman, and Eric Wernert
Massive Data Storage System Based on HPSS (High Performance Software System) First HPSS installation with distributed movers; STK 9310 Silos in Bloomington and Indianapolis Automatic replication of data between Indianapolis and Bloomington, via I-light, overnight. Critical for biomedical data, which is often irreplaceable. 180 TB capacity with existing tapes; total capacity of 480 TB. 100 TB currently in use; 1 TB for biomedical data. Common File System (CFS) – disk storage ‘for the masses’
Sun E10000 (Solar) Acquired 4/00 Shared memory architecture ~52 GFLOPS 64 400MHz cpus, 64GB memory > 2 TB external disk Supports some bioinformatics software available only (or primarily) under Solaris (e.g. GCG/SeqWeb) Used extensively by researchers using large databases (db performance, cheminformatics, knowledge management)
IBM Research SP (Aries/Orion Complex) Acquired 9/96, expanded in 1998, 1999, 2000,2001,2002 with help of IU IT Strategic Plan funds, IBM SUR grants and INGEN grant from Lilly Endowment, Inc. Geographically distributed at IUB and IUPUI 632 cpus, 1.005 TeraFLOPS First University-owned supercomputer in US to exceed 1 TFLOPS aggregate peak theoretical processing capacity Initially 50 th, now 112 th in Top 500 supercomputer list (soon to be lower) Distributed memory system with shared memory nodes
More node and configuration detail 2 Logical SPs –Aries complex (Bloomington) 508 Power3+ processors in WH2 nodes (4 processors per node). 8 Frames –Orion complex (Indianapolis) 44 Power3+ processors in WH2 nodes. 1 frame 64 Power3+ processors in NH2 nodes. 1 frame 32 Power4 processors in one Regatta node. 1 frame –0.45 TB RAM –5.3 TB hard disk (GPFS) Runs under one Loadlever instance – appears to be a single resource to users
Usage of SP AIX 5.1, wealth of software including SAS, SPSS, S- Plus, Mathematica, Matlab, Maple, Gaussian, GIS, scientific/numerical libraries, Oracle and DB2, and more All graduate students who do anything serious with computers have an account on the SP. Scripts to make system more user friendly Broad usage of SP is a key part of IU’s HPC strategy Single-user time available Have also combined this system with Purdue University’s SP for grid demonstrations
AVIDD AVIDD (Analysis and Visualization of Instrument-Driven Data) Analysis and Visualization of Instrument-Driven Data Project funded largely by the National Science Foundation (NSF), funds from Indiana University, and also by a Shared University Research grant from IBM, Inc.
AVIDD Project Hardware components: –Distributed Linux cluster Three locations: IU Northwest, Indiana University Purdue University Indianapolis, IU Bloomington 2.164 TFLOPS, 0.5 TB RAM, 10 TB Disk Tuned, configured, and optimized for handling real-time data streams –A suite of distributed visualization environments –Massive data storage Usage components: –Research by application scientists –Research by computer scientists –Education
Goals for AVIDD Create a massive, distributed facility ideally suited to managing the complete data/experimental lifecycle (acquisition to insight to archiving) Focused on modern instruments that produce data in digital format at high rates. Example instruments: –Advanced Photon Source, Advanced Light Source –Atmospheric science instruments in forest –Gene sequencers, expression chip readers
Goals for AVIDD, Con’t Performance goals: –Two researchers should be able simultaneously to analyze 1 TB data sets (along with other smaller jobs running) –The system should be able to give (nearly) immediate attention to real- time computing tasks, while still running at high rates of overall utilization –It should be possible to move 1 TB of data from HPSS disk cache into the cluster in ~2 hours Science goals: –The distribution of 3D visualization environments in scientists’ labs should enhance the ability of scientists to spontaneously interact with their data. –Ability to manage large data sets should no longer be an obstacle to scientific research –AVIDD should be an effective research platform for cluster engineering R&D as well as computer science research
More details on Linux Cluster AVIDD-N: IU Northwest: 18 1.3 GHz PIII processors. This cluster is for instructional use at the IU Northwest campus. (Funded primarily via a Shared University Research grant from IBM.) AVIDD-B and AVIDD-I: Two identical clusters, each with 208 2.4 GHz Prestonia processors. Each cluster has three types of nodes: head nodes, storage nodes, and compute nodes. (Servers: IBM x335) AVIDD-I64: 36 1.0 GHz Itanium processors (Servers: IBM Tiger) Myrinet2000, Gbit, and 100bT networks within cluster. Non-routing network using Force10 equipment between Bloomington and Indianapolis
Linux Cluster Software GPFS (proprietary from IBM). General Parallel File System System management system from IBM Maui Scheduler PBS Pro LAM/MPI Redhat Linux
1 TFLOPS Achieved on Linpack! AVIDD-I and AVIDD-B together = have peak theoretical capacity of 1.997 TFLOPS. We have just achieved 1.02 TFLOPS on Linpack benchmark for this distributed system. Details: –Force10 switches, non-routing 20 GB/Sec network connecting AVIDD-I and AVIDD-B. (~90 km distance) –LINPACK implementation from University of Tenessee called HPL (High Perfomrance LINPACK), ver 1.0 (http://www.netlib.org/benchmark/hpl/). Problem size we used is 220000, and block size is 200.http://www.netlib.org/benchmark/hpl/ –LAM/MPI 6.6 beta development version (3/23/2003) –Tuning: block size (optimized for smaller matrices, and then seemed to continue to work well), increased the default frame size for communications, fiddled with number of systems used, rebooted entire system just before running benchmark (!)
Cost of grid computing on performance Each of the two clusters alone achieved 682.5 GFLOPS, or 68% of peak theoretical of 998.4 GFLOPS per cluster The aggregate distributed cluster achieved 1.02 TFLOPS out of 1.997, or 51% of peak theoretical
Real-time pre-emption of jobs High overall rate of utilization, while able to respond ‘immediately’ to requests for real-time data analysis. System design –Maui Scheduler: support multiple QoS levels for jobs –PBSPro: support multiple QoS, and provide signaling for job termination, job suspension, and job checkpointing –LAM/MPI and Redhat: kernel-level checkpointing Options to be supported: –cancel and terminate job –Re-queue job –signal, wait, and requeue job –checkpoint job (as available) –signal job (used to send SIGSTOP/SIGRESUME)
And what actually works in the present PBSPro supports QoS with job suspension Performance tests so far by Dr. Beth Plale, IU Department of Computer Science –PBSPro job suspension: ~1.3 Seconds –RedHat Kernel: ~0.1 Seconds using RSH, ~0.8 with remote requests coming in via ssh. –Details available at http://www.iupui.edu/~ilight/proceedings/presentations/ilight_wrk shp02_plale.pdf
What we’ve learned/done with AVIDD so far Costs/benefits of distribution: performance, power, disaster recovery capabilities Have enabled new computer science research and new applications science research (esp. high-energy physics). Performance is surprisingly good on benchmarks Distributed visualization environments are useful 4-way stripe performance with HPSS so far inadequate (http://php.indiana.edu/~haiyang/demo.html)http://php.indiana.edu/~haiyang/demo.html More “person-intensive” than our traditional supercomputers The overall design – spending $$s on interconnect and on spinning disk – has produced a system well balanced for its intended tasks
Grid Computing at IU Comment: dangers of “global Globusization” IU Grid projects: –AVIDD, SP, etc. computer engineering –CCAT – Common Component Architecture Toolkit –xPort – “better than being there” remote utilization of instruments –Scientist’s notebook –ATLAS, iVDGL Pervasive Technology Laboratories (http://www.pervasivetechnologylabs.iu.edu/) –Geoffrey Fox – Community Grids –Andrew Lumsdaine, Dennis Gannon – Open Systems –Polly Baker – Visualization and Interactive Spaces –Steven Wallace – Advanced Network Management Lab –D.F. (Rick) McMullen – Knowledge Acquisition and Projection What is a grid?
And now some thoughts about HPC and biomedical computing
Bioinformatics and Biomedical Research Bioinformatics, Genomics, Proteomics, ____ics all promise to radically change our understanding of biological function and the way biomedical research is done. Traditional biomedical researchers must take advantage of new possibilities Computer-oriented researchers must take advantage of the tremendous store of detailed knowledge held by traditional biomedical researchers
Anopheles gambiae From www.sciencemag.org/feature/data/mosquito/mtm/index.html Source Library:Centers for Disease Control PHIL Photo Credit:Jim Gathany
Indiana Genomics Initiative (INGEN) INGEN was created by a $105M grant from the Lilly Endowment, Inc. and launched December, 2000 Build on traditional strengths of IU School of Medicine Build on IU's strength in Information Technology Add new programs of research made possible by the sequencing of the human genome Perform the research that will generate new treatments for human disease in the post-genomic era Improve human health generally and in the State of Indiana particularly Enhance economic growth in Indiana
INGEN Structure Programs –Bioethics –Genomics –Bioinformatics –Medical Informatics –Education –Training Cores –Tech Transfer –Gene Expression –Cell & Protein Expression –Human Expression –Information Technology –Proteomics –Integrated Imaging –In vivo Imaging –Animal
Information Technology Core Foci: –High Performance Computing –Visualization (esp. 3D) –Massive Data Storage –Support for use of all of the above $6.7M budget for IT Core Baseline IT services for School of Medicine responsibility of School of Medicine CIO
Challenges for UITS and the INGEN IT Core Assist traditional biomedical researchers in adopting use of advanced information technology (massive data storage, visualization, and high performance computing) Assist bioinformatics researchers in use of advanced computing facilities Questions we are asked: –Why wouldn't it be better just to buy me a newer PC? Questions we ask: –What do you do now with computers that you would like to do faster? –What would you do if computer resources were not a constraint?
INGEN IT Core Support Staff Visualization programmer, HPC programmer, and bioinformatics database specialist hired to support INGEN Staff added to existing management units within UITS –economy of scale (management, exchange of expertise) –Assures addition rather than substitution for base-funded consulting support
So, why is this better than just buying me a new PC? Unique facilities provided by IT Core –Redundant data storage –HPC – better uniprocessor performance; trivially parallel programming, parallel programming –Visualization in the research laboratories Hardcopy document – INGEN's advanced IT facilities: The least you need to know Outreach efforts Demonstration projects
Example projects Multiple simultaneous Matlab jobs for brain imaging. Installation of many commercial and open source bioinformatics applications. Site licenses for several commercial packages Evaluation of several software products that were not implemented.
Creation of new software Gamma Knife – Penelope. Modified existing version for more precise targeting with IU's Gamma Knife. Karyote (TM) Cell model. Developed a portion of the code used for model cell function. http://biodynamics.indiana.edu/ PiVNs. Software to visualize human family trees 3-DIVE (3D Interactive Volume Explorer). http://www.avl.iu.edu/projects/3DIVE/ fastDNAml – maximum likelihood phylogenies (http://www.indiana.edu/~rac/hpc/fastDNAml/index.html) Protein Family Annotator – collaborative development with IBM, Inc.
Data Integration Goal set by IU School of Medicine: Any research within the IU School of Medicine should be able to transparently query all relevant public external data sources and all sources internal to the IU School of Medicine to which the researcher has read privileges IU has more than 1 TB of biomedical data stored in massive data storage system There are many public data sources Different labs were independently downloading, subsetting, and formatting data Solution: IBM DiscoveryLink, DB/2 Information Integrator
Centralized Life Science Database (CSLD) Based on use of IBM DiscoveryLink (TM) and DB/2 Information Integrator (TM) Public data is still downloaded, parsed, and put into a database, but now the process is automated and centralized. Lab data and programs like BLAST are included via DL’s wrappers. Implemented in partnership with IBM Life Sciences via IU-IBM strategic relationship in the life sciences IU contributed writing of data parsers
INGEN IT Status Overall So far, so good 108 users of IU’s supercomputers 104 users of massive data storage system Six new software packages created or enhanced, more than 20 packages installed for use by INGEN-affiliated researchers Three software packages made available as open source software as direct result of INGEN. Opportunities for tech transfer due to use of Lesser GNU. The INGEN IT Core is providing services valued by traditionally trained biomedical researchers as well as researchers in bioinformatics, genomics, proteomics, etc. Work on Penelope code for Gamma Knife likely to be first major transferable technology development. Stands to improve efficacy of Gamma Knife treatment at IU Excellent success in supporting basic research Participation in grants and industrial partnerships provides economic benefit for IU
INGEN IT Success factors Creation of new position, Chief Information Officer and Associate Dean, within IU School of Medicine, and significant improvement in basic IT infrastructure within the IU School of Medicine INGEN has permitted IU to build on excellent IT infrastructure Dedicated (but not isolated) staff supporting INGEN researchers Commitment to customer service Outreach (in the proper formats) Scientific collaborations Strategy research on behalf of IU School of Medicine Accountability Leveraging of industrial partnerships
HPC strategies at IU HPC viewed as a critical area in IU’s leadership in IT State and regional private charitable trusts have been convinced to invest heavily in IT Active outreach to encourage broad representation of disciplines in use of HPC (little engineering at IU!) Careful attention to surveys of our customers Focus on industrial partnerships – especially research collaborations and hardware grants Focus on tech transfer/outreach – including creation of software licensed under terms of the lesser GNU license More information on this at http://www.indiana.edu/~rac/siguccs_copyright.html
Research in Indiana at SCxy www.research-indiana.org
And a bit about what I’m doing here: Teaching “Einführung in die Bioinformatik” Information exchange / looking for opportunities for future collaboration Working with HLRS staff to build up here a suite of the most important bioinformatics software Working on a research project about Biogrids Enjoying the hospitality and the opportunity to interact with many leading German experts in HPC!
Funding Support This research was supported in part by the Indiana Genomics Initiative (INGEN). The Indiana Genomics Initiative (INGEN) of Indiana University is supported in part by Lilly Endowment Inc. Joint Study Agreement with IBM, Inc. Protein Family Annotator: School of Informatics - M Dalkilic, Center for Genomics and Bioinformatics - P Cherbas, Univ. Information Technology Services & INGEN IT Core - C Stewart. This work was supported in part by Shared University Research grants from IBM, Inc. to Indiana University. This material is based upon work supported by the National Science Foundation under Grant No. 0116050 and Grant No. CDA-9601632. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF). And of course thanks to Dr. Michael Resch, Director, HLRS, for inviting me here, and thanks to the many people that I have met and learned from already!
Acknowledgements (People) UITS Research and Academic Computing Division managers: Mary Papakhian, David Hart, Stephen Simms, Richard Repasky, Matt Link, John Samuel, Eric Wernert, Anurag Shankar INGEN Staff: Andy Arenson, Chris Garrison, Huian Li, Jagan Lakshmipathy, David Hancock UITS Senior Management: Associate Vice President and Dean Christopher Peebles, RAC(Data) Director Gerry Bernbom Assistance with this presentation: John Herrin, Malinda Lingwall
Additional Information Further information is available at –ingen.iu.edu –http://www.indiana.edu/~uits/rac/ –http://cgb.indiana.edu/ –http://www.ncsc.org/casc/paper.html –http://www.indiana.edu/~rac/staff_papers.html