Presentation on theme: "October 26, 2001 Supercomputers for BioInformatics and The Grid Raj Godhia Consultant, Cray Inc. c/o Mega Computing (S) Pte Ltd."— Presentation transcript:
October 26, 2001 Supercomputers for BioInformatics and The Grid Raj Godhia Consultant, Cray Inc. c/o Mega Computing (S) Pte Ltd
October 26, 2001 Cray-NCI Announcement CRAY INC. AND NATIONAL CANCER INSTITUTE COLLABORATE ON MORE- POWERFUL BIOINFORMATICS RESEARCH TOOLS SEATTLE--(BUSINESS WIRE)--July 9, Goal is to Exploit Unique Supercomputer Technologies to Identify and Analyze Genes Involved in Cancer and Other Diseases; Demonstration Project Produces Full STR Mapping of Genome Cray Inc. (Nasdaq:CRAY) today announced it is collaborating with the National Cancer Institute (NCI) to develop bioinformatics research tools substantially more powerful than those available today. Bioinformatics is a high-potential market that involves applying computer technology to biology and medicine.CRAY By exploiting several unique, ultra-fast technologies originally designed into Cray supercomputers for classified government use, the NCI and Cray are working to create genome analysis software capable of identifying and analyzing genes involved in cancer and other diseases. In an initial demonstration project, scientists at the NCI's Advanced Biomedical Computing Center in Frederick, Md., produced a comprehensive map of short tandem repeat sequences (STRs) -- often used as gene markers -- for the entire human genome. Using the Cray SV1(TM) supercomputer located at the NCI, computations that previously took hours are being completed in seconds. This will enable biologists to do full-scale analyses that previously were impractical, Cray officials said. "In preliminary testing, the unique technologies available on Cray vector supercomputers have provided enormous speed-ups for full- scale analysis of some common types of bioinformatics problems," said Bill Long, Cray's chief collaborator for the NCI work. "Assuming this validation continues, we believe there is a potential to make full- scale, exhaustive analysis of many bioinformatics problems feasible for the first time." Although exhaustive analysis typically produces results that are ore complete and reliable than methods based on statistical sampling, he said, to date exhaustive analysis has been too slow and expensive to use routinely. Short tandem repeats, also known as microsatellites, are repetitive sequences of DNA that scientists have exploited for several years as tools to map new genes, study the structure of chromosomes, and compare the DNA of different species, all of which are major areas of interest in biology and medical research. Other bioinformatics software tools under development in the NCI-Cray collaboration include: non-tandem repeats, EST cluster assembly, CG island detection, genome assembly from BAC clones, SNP (single nucleotide polymorphism) analysis, and the extension to protein sequences for proteomic applications. "We are excited about the initial results of our collaboration with the NCI and optimistic about the larger potential for applying our unique technologies in the field of bioinformatics," said Jim Rottsolk, Cray Inc. chairman and CEO. Cray SV1 supercomputer systems start at under $1 million (U.S. list), are air cooled and fit easily into office environments. About NCI's Advanced Biomedical Computing Center The NCI's Advanced Biomedical Computing Center (Frederick, Md.) serves 1,800 biological researchers worldwide. Using a Cray supercomputer, ABCC played a critical role in solving the 3-D structure of HIV-1 protease, an enzyme that HIV utilizes to infect human immune cells. With the 3-D structure clarified, scientists were able to design highly effective protease inhibitors that are now the mainstay of AIDS therapy. For this work, ABCC was named a finalist for the prestigious Computerworld Smithsonian science award in 2000.
October 26, 2001 National Cancer Institute – Cray Collaboration Use the special hardware features of the Cray SV1 cluster to address genomic and proteomic issues. Integrate genomics, post-genomic, and proteomic methods to provide insights into the mechanism of cancer. NCI making results such as STR Database available via the web.
October 26, 2001 NCI’s Advanced Biomedical Computing Center Cray J90SE 16PE 1GW Cray SV1 96PE 12GW Cray J90 8PE 256MW GigaRing Parallel Vector Environment Origin PE 32GbSGI ServersCompaq 8400IBM SP2 Storagetek Tape Silo Workstations and File Servers
October 26, 2001 What Is an STR, and Why Do I Care? STR ( Short Tandem Repeat ) –String of ‘n’ letters ( nucleotides ) repeated ‘m’ times (‘m’ usually >6) : ATATATATATATAT Why STRs are important –They can be associated with gene locations, diseases, and other important biology –They can affect the accuracy of algorithms used to assemble the genome –They are used for forensic identification –…
October 26, 2001 Human Genome > 3 Billion Base Pairs of Nucleotides All Short Tandem Repeats (2-8) found in <10 minutes on Cray SV1 – 1 CPU; 150 sec on 15 CPUs of SV1e. NCI believes such methodologies show great promise for genome analysis and proteomics
October 26, 2001 Unique Cray Features Several capabilities, not just one –Unique, hard-to-replicate combination of hardware features –Benefits from applying multiple processors (CPUs) Originally created for intelligence community –~100x faster than anything else for classified problems –Key bioinformatics problems look like classified problems Bioinformatics ‘connection’ was serendipitous –One clever individual Resident in Cray SV1, MTA-2, SV2 –Experience to date is with SV1 series Cray SV1™ Supercomputer
October 26, 2001 SV1 Kernel Performance Nucleotide encoding: 600M characters/sec. Difference counting: 200M starting points/sec. –For a 32 nucleotide sequence, this would be 6.4G nucleotides/second Reverse complement: 4G nucleotides/sec. –For example, the complete human genome can be reverse complemented in about 1 second
October 26, 2001 Performance Comparisons Source: NCI
October 26, 2001 Kernel Status & Plans Available: –Nucleotide encoding –Reverse complement ( turn ACCTG into CAGGT ) –Difference count –Tandem repeat search In progress: –Amino acid encoding & comparison scoring –Nucleotide sorting ( for non-tandem repeats ) –Higher level drivers –…
October 26, 2001 Supercomputing and The Grid Several organizations in Asia intend to implement GLOBUS on Cray SV1 systems and make them available to BioInformatics users Cray systems will play a major role on The Grid –Supercomputer centers like SDSC have always provided service to remote users Some organizations are confronting implementation issues running “coupled” jobs on The Grid using distributed memory techniques –Shared memory supercomputers may play an important role as “couplers” for Grid-based distributed applications