A Pervasive Technology Institute Center What is The National Center for Genome Analysis Support? NCGAS is a national center dedicated to providing scientists.

Slides:



Advertisements
Similar presentations
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Advertisements

Experiences with a large-memory HP cluster – performance on benchmarks and genome codes Craig A. Stewart Executive Director, Pervasive.
Next Generation Sequencing, Assembly, and Alignment Methods
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Collaborative Information Management: Advanced Information Processing in Bioinformatics Joost N. Kok LIACS - Leiden Institute of Advanced Computer Science.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
The Human Genome Race. Collins vs. Venter Collins Venter.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Sequencing a genome (a) outline the steps involved in sequencing the genome of an organism; (b) outline how gene sequencing allows for genome-wide comparisons.
SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.
De-novo Assembly Day 4.
Campus Bridging: What is it and why is it important? Barbara Hallock – Senior Systems Analyst, Campus Bridging and Research Infrastructure.
Bioinformatics Core Facility Ernesto Lowy February 2012.
Statewide IT Conference, Bloomington IN (October 7 th, 2014) The National Center for Genome Analysis Support, IU and You! Carrie Ganote (Bioinformatics.
Next Generation Cyberinfrastructures for Next Generation Sequencing and Genome Science AAMC 2013 Information Technology in Academic Medicine Conference.
CS 394C March 19, 2012 Tandy Warnow.
ARC Biotechnology Platform: Sequencing for Game Genomics Dr Jasper Rees
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.
The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists Internet2 Network Infrastructure for the Life Sciences Focused.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
The Center for Medical Genomics facilitates cutting-edge research with state-of-the-art genomic technologies for studying gene expression and genetics,
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
RNA-Seq 2013, Boston MA, 6/20/2013 Optimizing the National Cyberinfrastructure for Lower Bioinformatic Costs: Making the Most of Resources for Publicly.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
An Efficient Method of Generating Whole Genome Sequence for Thousands of Bulls Chuanyu Sun 1 and Paul M. VanRaden 2 1 National Association of Animal Breeders,
DAN LAWSON BRC 2011 – ANNUAL MEETING UT SOUTHWESTERN MEDICAL CENTER DALLAS, TX SEPTEMBER 2011 Challenges and opportunities of new sequencing technologies.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
CS177 Lecture 10 SNPs and Human Genetic Variation
Enabling Science Through Campus Bridging A case study with mlRho Scott Michael July 24, 2013.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Chapter 21 Eukaryotic Genome Sequences
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
© 2010 by The Samuel Roberts Noble Foundation, Inc. 1 The Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK, 73401, USA 2 National Center.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Bioinformatics Core Facility Guglielmo Roma January 2011.
EB3233 Bioinformatics Introduction to Bioinformatics.
The National Center for Genomic Analysis Support: creating a national cyberinfrastructure environment for genomics researchers. William Barnett, Thomas.
Pti.iu.edu/sc14 The National Center for Genome Analysis Support Supercomputing 2014 November 17-21, 2014.
The iPlant Collaborative
Bio-IT World Conference and Expo ‘12, April 25, 2012 A Nation-Wide Area Networked File System for Very Large Scientific Data William K. Barnett, Ph.D.
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Galaxy Community Conference July 27, 2012 The National Center for Genome Analysis Support and Galaxy William K. Barnett, Ph.D. (Director) Richard LeDuc,
BLAST Sequences queried against the nr or grass databases. GO ANALYSIS Contigs classified based on homology to known plant or fungal genes Next.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
NCGAS provides A specific goal is to provide dedicated access to memory rich supercomputers customized for genomics studies, including Mason and other.
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
Computational Sciences at Indiana University an Overview Rob Quick IU Research Technologies HTC Manager.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Transforming Science Through Data-driven Discovery Workshop Overview Ohio State University MCIC Jason Williams – Lead, CyVerse – Education, Outreach, Training.
1 Campus Bridging: What is it and why is it important? Barbara Hallock – Senior Systems Analyst, Campus Bridging and Research Infrastructure.
Metagenomic Species Diversity.
Bellwork: What is the human genome project. What was its purpose
Transcriptome Assembly
Richard LeDuc, Ph.D. (Manager)
Genomes and Their Evolution
Genome organization and Bioinformatics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
3.1 Genes Essential idea: Every living organism inherits a blueprint for life from its parents. Genes and hence genetic information is inherited from.
Introduction to Sequencing
Presentation transcript:

a Pervasive Technology Institute Center What is The National Center for Genome Analysis Support? NCGAS is a national center dedicated to providing scientists easy and ready access to the software and supercomputers necessary for the important work of genomics research. Initially funded by the National Science Foundation Advances in Biological Informatics (ABI) program, grant # Provides access to memory rich supercomputers customized for genomics studies, including Mason and XSEDE other systems. A Cyberinfrastructure Service Center affiliated with the Pervasive Technology Institute at Indiana University ( Provides distributions of hardened versions of popular codes Particularly dedicated to genome assembly software such as: de Bruijn graph methods: SOAPdeNovo, Velvet, ABySS consensus methods: Celera, Arachne 2 For more information, see Mason – a HP ProLiant DL580 G7 provided by NCGAS 16 node cluster 10GE interconnect –Cisco Nexus 7018 –Compute nodes are oversubscribed 4:1 –This is the same switch that we use for DC and other 10G connected equipment. Quad socket nodes –8 core Xeon L7555, 1.87 GHz base frequency –32 cores per node –512 GByte of memory per node! Rated at TFLOPs (G-HPL benchmark) STEP 1: data pre- processing, to evaluate and improve the quality of the input sequence STEP 2: sequence alignment to a known reference genome STEP 3: SNP detection to scan the alignment result for new polymorphisms NCGAS Sandbox Demo at Supercomputing 11 Early Users: Metagenomics Sequence Analysis Yuzhen Ye Lab (IU Bloomington School of Informatics) Early Users: Genome Assembly and Annotation Michael Lynch Lab (IU Bloomington, Department of Biology) Assembles and annotates Genomes in the Paramecium aurelia species complex in order to eventually study the evolutionary fates of duplicate genes after whole-genome duplication. This project also has been performing RNAseq on each genome, which is currently used to aid in genome annotation and subsequently to detect expression differences between paralogs. The assembler used is based on an overlap-layout-consensus method instead of a de Bruijn graph method (like some of the newer assemblers). It is more memory intensive – requires performing pairwise alignments between all pairs of reads. The annotation of the genome assemblies involves programs such as GMAP, GSNAP, PASA, and Augustus. To use these programs, we need to load-in millions of RNAseq and EST reads and map them back to the genome. Early Users: Genome Informatics for Animals and Plants Genome Informatics Lab (IU Bloomington Department of Biology) This project is to find genes in animals and plants, using the vast amounts of new gene information coming from next generation sequencing technology. These improvements are applied to newly deciphered genomes for an environmental sentinel animal, the waterflea (Daphnia), the agricultural pest insect Pea aphid, the evolutionarily interesting jewel wasp (Nasonia), and the chocolate plant (Th. cacao) which will bring genomics to sustainable agriculture of cacao. Large memory compute systems are needed for biological genome and gene transcript assembly because assembly of genomic DNA or gene RNA sequence reads (in billions of fragments) into full genomic or gene sequences requires a minimum of 128 GB of shared memory, more depending on data set. These programs build graph matrices of sequence alignments in memory. Early Users: Imputation of Genotypes And Sequence Alignment Tatiana Foroud Lab (IU School of Medicine, Medical and Molecular Genetics) Study complex disorders by using imputation of genotypes typically for genome wide association studies as well as sequence alignment and post-processing of whole genome and whole exome sequencing. Requires analysis of markers in a genetic region (such as a chromosome) in several hundred representative individuals genotyped for the full reference panel of SNPs, with extrapolation of the inferred haplotype structures. More memory allows the imputation algorithms to evaluate haplotypes across much broader genomic regions, reducing or eliminating the need to partition the chromosomes into segments. This increases the accuracy and speed of imputed genotypes, allowing for improved evaluation of detailed within-study results as well as communication and collaboration (including meta-analysis) using the disease study results with other researchers. Early Users:Daphnia Population Genomics Michael Lynch Lab (IU Bloomington Department of Biology) This project involves the whole genome shotgun sequences of over 20 more diploid genomes with genomes sizes >200 Megabases each. With each genome sequenced to over 30 x coverage, the full project involves both the mapping of reads to a reference genome and the de novo assembly of each individual genome. The genome assembly of millions of small reads often requires excessive memory use for which we once turned to Dash at SDSC. With Mason now online at IU, we have been able to run our assemblies and analysis programs here at IU. IU's NCGAS partners include the Texas Advanced Computing Center (TACC) and the San Diego Supercomputer Center (SDSC), and will support software running on supercomputers at TACC and SDSC, as well as other supercomputers available as part of XSEDE (the new NSF-funded Extreme Science and Engineering Discovery Environment). NCGAS will further campus-based integration, known as "campus bridging." Thomas G. Doak Le-Shin Wu, Craig A. Stewart, Robert Henschel, William K. Barnett A specific goal is to provide dedicated access to large memory supercomputers, such as IU's new Mason system. Each Mason compute node has 512GB of random access memory, critical for data-intensive science applications such as genome assembly. Environmental sequencing –Sampling DNA sequences directly from the environment –Since the sequences consists of DNA fragments from hundreds or even thousands of species, the analysis is far more difficult than traditional sequence analysis that involves only one species. Assembling metagenomic sequences and deriving genes from the dataset Dynamic programming to optimally map consecutive contigs from the assembly. Since the number of contigs is enormous for most metagenomic dataset, a large memory computing system is required to perform the dynamic programming algorithm so that the task can be completed in polynomial time. For Indiana University’s Supercomputing 11 research sandbox demo, NCGAS implemented a biological application to simulate a sequence alignment and SNP (single nucleotide polymorphism) identification pipeline (shown above). The goal is to demonstrate that, with a network bridging between NCGAS computing nodes at IU and a remote storage file system, we are able to conduct a data intensive pipeline without repetitive data file movement.