Department of Bioinformatics and Computational Biology

Slides:



Advertisements
Similar presentations
Next-Generation Sequencing: Methodology and Application
Advertisements

Mo17 shotgun project Goal: sequence Mo17 gene space with inexpensive new technologies Datasets in progress: Four-phases of 454-FLX sequencing to max of.
High throughput sequencing Barbera van Schaik
Functional Genomics with Next-Generation Sequencing
The Past, Present, and Future of DNA Sequencing
The Good, Bad, and Ugly of Next-Gen Sequencing
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Next–generation DNA sequencing technologies – theory & practice
Bioinformatics Lectures at Rice
Peter Tsai Bioinformatics Institute, University of Auckland
Next-generation sequencing
Next Generation Sequencing, Assembly, and Alignment Methods
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.
Greg Phillips Veterinary Microbiology
Transcriptomics Jim Noonan GENE 760.
The SOLiD System: Next-Generation Sequencing Overview of the SOLiD System –  Scalable  Accurate Ultra High Throughput  Flexible  Mate Pairs.
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
High Throughput Sequencing
mRNA-Seq: methods and applications
CS 6293 Advanced Topics: Current Bioinformatics
Diabetes and Endocrinology Research Center The BCM Microarray Core Facility: Closing the Next Generation Gap Alina Raza 1, Mylinh Hoang 1, Gayan De Silva.
Next generation sequencing platforms Applications
The impact of next-generation sequencing technology of genetics Elaine R. Mardis – 11 February Washington School of Medicine, Genome Sequencing Center.
Next Now-Generation Genomics: methods and applications for modern disease research Aaron J. Mackey, Ph.D. Center for Public Health.
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
Next generation sequencing Xusheng Wang 4/29/2010.
Dr Katie Snape Specialist Registrar in Genetics St Georges Hospital
Whole Exome Sequencing for Variant Discovery and Prioritisation
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
High Throughput Sequencing Methods and Concepts
Todd J. Treangen, Steven L. Salzberg
Massive Parallel Sequencing
Genomics and High Throughput Sequencing Technologies: Applications Jim Noonan Department of Genetics.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Verna Vu & Timothy Abreo
I519 Introduction to Bioinformatics, Fall, 2012
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Next Generation Sequencing
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
SEQUENCING – THE BENCHTOPS. Roche 454 Junior Same technology as 454 FLX Read length: 400 bases Paired-end 100,000 reads 12 hours (instrument time) Output.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
Introduction to RNAseq
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Lecture-5 ChIP-chip and ChIP-seq
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Canadian Bioinformatics Workshops
Next-generation sequencing technology
DNA Sequencing Second generation techniques
Short Read Sequencing Analysis Workshop
Next generation sequencing
Cancer Genomics Core Lab
Gil McVean Department of Statistics
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Next-generation sequencing technology
Gene expression estimation from RNA-Seq data
2nd (Next) Generation Sequencing
Next-generation DNA sequencing
Presentation transcript:

Department of Bioinformatics and Computational Biology Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis Han Liang, Ph.D. Department of Bioinformatics and Computational Biology 3/25/2014 @ Rice University

Outline History NGS Platforms Applications Bioinformatics Analysis Challenges

Central Dogma

Sanger sequencing DNA is fragmented Cloned to a plasmid vector Cyclic sequencing reaction Separation by electrophoresis Readout with fluorescent tags

Sanger vs NGS ‘Sanger sequencing’ has been the only DNA sequencing method for 30 years but… …hunger for even greater sequencing throughput and more economical sequencing technology… NGS has the ability to process millions of sequence reads in parallel rather than 96 at a time (1/6 of the cost) Objections: fidelity, read length, infrastructure cost, handle large volum of data .

Platforms Roche/454 FLX: 2004 Illumina Solexa Genome Analyzer: 2006 Applied Biosystems SOLiDTM System: 2007 Helicos HeliscopeTM : recently available Pacific Biosciencies SMRT: launching 2010

Quickly reduced Cost

Three Leading Sequencing Platforms Roche 454 Illumina Solexa Applied Biosystems SOLiD

The general experimental procedure Wang et al. Nature Reviews Genetics 2009

454 bead microreactor Maridis Annu. Rev. Genome. Human Genet. 2008

Illumina (Solexa) Bridge amplification Maridis Annu. Rev. Genome. Human Genet. 2008

SOLiD color coding Maridis Annu. Rev. Genome. Human Genet. 2008

Comparison of existing methods

Real Data – nucleotide space Solexa @SRR002051.1 :8:1:325:773 length=33 AAAGAACATTAAAGCTATATTATAAGCAAAGAT +SRR002051.1 :8:1:325:773 length=33 IIIIIIIIIIIIIIIIIIIIIIIII'II@I$)- @SRR002051.2 :8:1:409:432 length=33 AAGTTATGAAATTGTAATTCCAATATCGTAAGC +SRR002051.2 :8:1:409:432 length=33 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII07 @SRR002051.3 :8:1:488:490 length=33 AATTTCTTACCATATTAGACAAGGCACTATCTT +SRR002051.3 :8:1:488:490 length=33 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII&I

Real Data – color space SOLiD Data >1_24_47_F3

Data output difference among the three platforms Nucleotide space vs. color space Length of short reads 454 (400~500 bp) > SOLiD (70 bp) ~ Solexa (36~120bp)

Applications with “Digital output” De novo genome assembly Genome re-sequencing RNA-Seq (gene expression, exon-intron structure, small RNA profiling, and mutation) CHIP-Seq (protein-DNA interaction) Epigenetic profiling

Ancient Genomes Resurrected Degraded state of the sample  mitDNA sequencing Nuclear genomes of ancient remains: cave bear, mommoth, Neanderthal (106 bp ) Problems: contamination modern humans and coisolation bacterial DNA

Elucidating DNA-protein interactions through chromoatin immunoprecipitation sequencing Key part in regulating gene expression Chip: technique to study DNA-protein interaccions Recently genome-wide ChIP-based studies of DNA-protein interactions Readout of ChIP-derived DNA sequences onto NGS platforms Insights into transcription factor/histone binding sites in the human genome Enhance our understanding of the gene expression in the context of specific environmental stimuli

Discovering noncoding RNAs ncRNA presence in genome difficult to predict by computational methods with high certainty because the evolutionary diversity Detecting expression level changes that correlate with changes in environmental factors, with disease onset and progression, complex disease set or severity Enhance the annotation of sequenced genomes (impact of mutations more interpretable)

Metagenomics Characterizing the biodiversity found on Earth The growing number of sequenced genomes enables us to interpret partial sequences obtained by direct sampling of specif environmental niches. Examples: ocean, acid mine site, soil, coral reefs, human microbiome which may vary according to the health status of the individual

Defining variability in many human genomes Common variants have not yet completly explained complex disease genetics rare alleles also contribute Also structural variants, large and small insertions and deletions Accelerating biomedical research

Epigenomic variation Enable of genome-wide patterns of methylation and how this patterns change through the course of an organism’s development. Enhanced potential to combine the results of different experiments, correlative analyses of genome-wide methylation, histone binding patterns and gene expression, for example.

:Integrating Omics Mutation discovery Protein-DNA interaction Copy number variation mRNA expression microRNA expression Alternative Splicing Kahvejian et al. 2008

decoding, filter and mapping Data Analysis Flow SOLiD machine: Raw data Central Server Basic processing decoding, filter and mapping Local Machine Downstream analysis

Short Read Mapping DNA-Resequencing BLAST-like approach RNA-Seq

Read length and pairing ACTTAAGGCTGACTAGC TCGTACCGATATGCTG Short reads are problematic, because short sequences do not map uniquely to the genome. Solution #1: Get longer reads. Solution #2: Get paired reads.

Post-alignment Analysis DNA-SEQ SNP calling RNA-SEQ Quantifying gene expression level

Concepts The reference genome: Target Region: exonome hg19 (GRC37) Main assembly: Chr1-22, X, and Y 3,095,677,412 bp Target Region: exonome Ensembl: 85.3 Million (2.94%) RefSeq: 67.7Million (2.34%) ccds: 31,266,049 (1.08%) consisting of 185,446 nr exons

Target Coverage

SOLiD color coding Maridis Annu. Rev. Genome. Human Genet. 2008

SNP calling

Array-based High-throughput Dataset

Limitations of hybridization-based approach Reliance existing knowledge about genome sequence Background noise and a limited dynamic detecting range Cross-experiment comparison is difficult Requiring complicated normalization methods Wang et al. Nature Reviews Genetics 2009

Quantifying gene expression using RNA-Seq data RPKM: Reads Per Kb exon length and Millions of mapped readings

Large Dynamic Range Mortazavi et al. Nature Methods 2008

High reproducibility Mortazavi et al. Nature Methods 2008

High Accuracy Wang et al. Nature 2008

Advantages of RNA-Seq Not limited to the existing genomic sequence Very low (if any) background signal Large dynamic detecting range Highly reproducibility Highly accurate Less sample Low cost per base Wang et al. Nature Reviews Genetics 2009

Huge amount of data! For a typical RNA-Seq SOLiD run, ~ 2T image file ~ 120G text file for downstream analysis ~ 75 M short reads per sample Efficient methods for data storage and management

Considerable sequencing error High-quality image analysis for base calling

Genome alignment and assembly: time consuming and memory demanding To perform genome mapping for SOLiD data 32-opteron HP DL785 with 128GB of ram 12~14 hours per sample High-performance parallel computing

Bioinformatics Challenges Efficient methods to store, retrieve and process huge amount of data To reduce errors in image analysis and base calling Fast and accurate for genome alignment and assembly New algorithms in downstream analyses

Experimental Challenges Library fragmentation Strand specific Wang et al. Nature Reviews Genetics 2009

Question& Answer Han Liang E-mail: hliang1@mdanderson.org Tel: 713-745-9815