Massive Parallel Sequencing

Slides:



Advertisements
Similar presentations
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Advertisements

Tingwen Chen (陳亭妏) Bioinformatics center CGU
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Chromatin Immuno-precipitation (CHIP)-chip Analysis
GNUMap: Unbiased Probabilistic Mapping of Next- Generation Sequencing Reads Nathan Clement Computational Sciences Laboratory Brigham Young University Provo,
SeqMap: mapping massive amount of oligonucleotides to the genome Hui Jiang et al. Bioinformatics (2008) 24: The GNUMAP algorithm: unbiased probabilistic.
Detecting DNA-protein Interactions Xinghua Lu Dept Biomedical Informatics BIOST 2055.
Canadian Bioinformatics Workshops
Greg Phillips Veterinary Microbiology
Design Goals Crash Course: Reference-guided Assembly.
Sequence Alignment technology Chengwei Lei Fang Yuan Saleh Tamim.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Introduction to Genomics, Bioinformatics & Proteomics Brian Rybarczyk, PhD PMABS Department of Biology University of North Carolina Chapel Hill.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Evaluation of Signaling Cascades Based on the Weights from Microarray and ChIP-seq Data by Zerrin Işık Volkan Atalay Rengül Çetin-Atalay Middle East Technical.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
High Throughput Sequencing
Bioinformatics Analysis Team McGill University and Genome Quebec Innovation Center
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Department of Bioinformatics and Computational Biology
CS 6293 Advanced Topics: Current Bioinformatics
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Next generation sequencing platforms Applications
Department of Biomedical Informatics Biomedical Data Visualization Kun Huang Department of Biomedical Informatics OSUCCC Biomedical Informatics Shared.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
MRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Kristen Horstmann, Tessa Morris, and Lucia Ramirez Loyola Marymount University March 24, 2015 BIOL398-04: Biomathematical Modeling Lee, T. I., Rinaldi,
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Next Generation DNA Sequencing
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.
EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics.
I519 Introduction to Bioinformatics, Fall, 2012
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
Next Generation Sequencing
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Analysis of protein-DNA interactions with tiling microarrays
billion-piece genome puzzle
Trends Biomedical Science
 CHANGE!! MGL Users Group meetings will now be on the 1 st Monday of each month 3:00-4:00 Room Note the change of time and room.
Lecture-5 ChIP-chip and ChIP-seq
Doug Raiford Phage class: introduction to sequence databases.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Motif Search and RNA Structure Prediction Lesson 9.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Introduction to Next Generation Sequencing. Strategies For Interrogating the Transcriptome Known genes Predicted genes Surrogate strategy Exon verification.
Special Topics in Genomics ChIP-chip and Tiling Arrays.
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Department of Computer Science
Introduction to Bioinformatics II
Next-generation DNA sequencing
ChIP-seq Robert J. Trumbly
CS 6293 Advanced Topics: Translational Bioinformatics
Basic Local Alignment Search Tool
Presentation transcript:

Massive Parallel Sequencing Kun Huang, PhD Department of Biomedical Informatics OSU CCC Bioinformatics Shared Resources

Introduction High throughput sequencing – a new paradigm Applications Solexa SOLiD 454/Roche Genome Sequencer Applications Genome sequencing microRNA screening Gene expression ChIP-seq Genome-worth of sequence, terabytes of data

What is ChIP-Sequencing? ChIP-Sequencing is a new frontier technology to analyze protein interactions with DNA. ChIP-Seq Combination of chromatin immunoprecipitation (ChIP) with ultra high-throughput massively parallel sequencing Allow mapping of protein–DNA interactions in-vivo on a genome scale

Workflow of ChIP-Seq Mardis, E.R. Nat. Methods 4, 613-614 (2007)

Workflow of ChIP-Seq

ChIP-seq Challenges: Millions of segments Mapping to genome Visualization Peak detection Data normalization …

Johnson et al, 2007 ChIP-Seq technology is used to understand in vivo binding of the neuron-restrictive silencer factor (NRSF) Results are compared to known binding sites ChIP-Seq signals are strongly agree with the existing knowledge Sharp resolution of binding position New noncanonical NRSF binding motifs are identified

Robertson et al, 2007 ChIP-Seq technology used to study genome-wide profiles of STAT1 DNA association STAT1 targets in interferon-γ-stimulated and unstimulated human HeLA S3 cells are compared The performance of ChIP-Seq is compared to the alternative protein-DNA interaction methods of ChIP-PCR and ChIP-chip. 41,582 and 11,004 putative STAT-1 binding regions are identified in stimulated and unstimulated cells respectively.

Why ChIP-Sequencing? Current microarray and ChIP-ChIP designs require knowing sequence of interest as a promoter, enhancer, or RNA-coding domain. Lower cost Less work in ChIP-Seq Higher accuracy Alterations in transcription-factor binding in response to environmental stimuli can be evaluated for the entire genome in a single experiment.

Bioinformatics

Sequencers Solexa (Illumina) 454 Life Sciences (Roche Diagnostics) 1 GB of sequences in a single run 35 bases in length 454 Life Sciences (Roche Diagnostics) 25-50 MB of sequences in a single run Up to 500 bases in length SOLiD (Applied Biosystems) 6 GB of sequences in a single run

Illumina Genome Analysis System 8 lanes 100 tiles per lane

Sequencing

Sequencer Output Quality Scores Sequence Files

Sequence Files ~10 million sequences per lane ~500 MB files

Quality Score Files Quality scores describe the confidence of bases in each read Solexa pipeline assigns a quality score to the four possible nucleotides for each sequenced base 9 million sequences (500MB file)  ~6.5GB quality score file

Bioinformatics Challenges Rapid mapping of these short sequence reads to the reference genome Visualize mapping results Thousand of enriched regions Peak analysis Peak detection Finding exact binding sites Compare results of different experiments Normalization Statistical tests

Mapping of Short Oligonucleotides to the Reference Genome Mapping Methods Need to allow mismatches and gaps SNP locations Sequencing errors Reading errors Indexing and hashing genome oligonucleotide reads Use of quality scores Use of SNP knowledge Performance Partitioning the genome or sequence reads

Mapping Methods: Indexing the Genome Fast sequence similarity search algorithms (like BLAST) Not specifically designed for mapping millions of query sequences Take very long time e.g. 2 days to map half million sequences to 70MB reference genome (using BLAST) Indexing the genome is memory expensive

SOAP (Li et al, 2008) Both reads and reference genome are converted to numeric data type using 2-bits-per-base coding Load reference genome into memory For human genome, 14GB RAM required for storing reference sequences and index tables 300(gapped) to 1200(ungapped) times faster than BLAST

SOAP (Li et al, 2008) 2 mismatches or 1-3bp continuous gap Errors accumulate during the sequencing process Much higher number of sequencing errors at the 3’-end (sometimes make the reads unalignable to the reference genome) Iteratively trim several basepairs at the 3’-end and redo the alignment Improve sensitivity

Mapping Methods: Indexing the Oligonucleotide Reads ELAND (Cox, unpublished) “Efficient Large-Scale Alignment of Nucleotide Databases” (Solexa Ltd.) SeqMap (Jiang, 2008) “Mapping massive amount of oligonucleotides to the genome” RMAP (Smith, 2008) “Using quality scores and longer reads improves accuracy of Solexa read mapping” MAQ (Li, 2008) “Mapping short DNA sequencing reads and calling variants using mapping quality scores”

Mapping Algorithm (2 mismatches) GATGCATTGCTATGCCTCCCAGTCCGCAACTTCACG GATGCATTG CTATGCCTC CCAGTCCGC AACTTCACG seeds GATGCATTG CTATGCCTC CCAGTCCGC AACTTCACG ......... Exact match Genome Indexed table of exactly matching seeds Approximate search around the exactly matching seeds

Mapping Algorithm (2 mismatches) Partition reads into 4 seeds {A,B,C,D} At least 2 seed must map with no mismatches Scan genome to identify locations where the seeds match exactly 6 possible combinations of the seeds to search {AB, CD, AC, BD, AD, BC} 6 scans to find all candidates Do approximate matching around the exactly-matching seeds. Determine all targets for the reads Ins/del can be incorporated The reads are indexed and hashed before scanning genome Bit operations are used to accelerate mapping Each nt encoded into 2-bits

ELAND (Cox, unpublished) Commercial sequence mapping program comes with Solexa machine Allow at most 2 mismatches Map sequences up to 32 nt in length All sequences have to be same length

RMAP (Smith et al, 2008) Improve mapping accuracy Possible sequencing errors at 3’-ends of longer reads Base-call quality scores Use of base-call quality scores Quality cutoff High quality positions are checked for mismatces Low quality positions always induce a match Quality control step eliminates reads with too many low quality positions Allow any number of mismatches

Mapped to a unique location Map to reference genome Mapped to a unique location Mapped to multiple locations No mapping Low quality 7.2 M 1.8 M 2.5 M 0.5 M 12 M 3 M Quality filter

Bioinformatics Challenges Rapid mapping of these short sequence reads to the reference genome Visualize mapping results Thousand of enriched regions Peak analysis Peak detection Finding exact binding sites Compare results of different experiments Normalization Statistical tests

Visualization BED files are build to summarize mapping results BED files can be easily visualized in Genome Browser http://genome.ucsc.edu

Visualization: Genome Browser Robertson, G. et al. Nat. Methods 4, 651-657 (2007)

Visualization: Custom 300 kb region from mouse ES cells Mikkelsen,T.S. et al. Nature 448, 553-562 (2007)

Visualization Huang, 2008 (unpublished)

Huang, 2008 (unpublished)

Bioinformatics Challenges Rapid mapping of these short sequence reads to the reference genome Visualize mapping results Thousand of enriched regions Peak analysis Peak detection Finding exact binding sites Compare results of different experiments Normalization Statistical tests

Peak Analysis Peak Detection ChIP-Peak Analysis Module (Swiss Institute of Bioinformatics) ChIPSeq Peak Finder (Wold Lab, Caltech)

Peak Analysis Finding Exact Binding Site Determining the exact binding sites from short reads generated from ChIP-Seq experiments SISSRs (Site Identification from Short Sequence Reads) (Jothi 2008) MACS (Model-based Analysis of ChIP-Seq) (Zhang et al, 2008)

Bioinformatics Challenges Rapid mapping of these short sequence reads to the reference genome Visualize mapping results Thousand of enriched regions Peak analysis Peak detection Finding exact binding sites Compare results of different experiments Normalization Statistical tests

Compare Samples Huang, 2008 (unpublished)

Compare Samples Fold change HPeak: An HMM-based algorithm for defining read-enriched regions from massive parallel sequencing data Xu et al, 2008 Advanced statistics

QUESTIONS?