Next Generation Sequencing

Slides:



Advertisements
Similar presentations
Functional Genomics with Next-Generation Sequencing
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
RNAseq.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Tingwen Chen (陳亭妏) Bioinformatics center CGU
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
GNUMap: Unbiased Probabilistic Mapping of Next- Generation Sequencing Reads Nathan Clement Computational Sciences Laboratory Brigham Young University Provo,
SeqMap: mapping massive amount of oligonucleotides to the genome Hui Jiang et al. Bioinformatics (2008) 24: The GNUMAP algorithm: unbiased probabilistic.
SOLiD Sequencing & Data
Transcriptome Sequencing with Reference
Detecting DNA-protein Interactions Xinghua Lu Dept Biomedical Informatics BIOST 2055.
Data Analysis for High-Throughput Sequencing
Canadian Bioinformatics Workshops
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Greg Phillips Veterinary Microbiology
Design Goals Crash Course: Reference-guided Assembly.
Sequence Alignment technology Chengwei Lei Fang Yuan Saleh Tamim.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Evaluation of Signaling Cascades Based on the Weights from Microarray and ChIP-seq Data by Zerrin Işık Volkan Atalay Rengül Çetin-Atalay Middle East Technical.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Informatics for next-generation sequence analysis – SNP calling Gabor T. Marth Boston College Biology Department PSB 2008 January
High Throughput Sequencing
Bioinformatics Analysis Team McGill University and Genome Quebec Innovation Center
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Department of Bioinformatics and Computational Biology
Next generation sequencing platforms Applications
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
The virochip (UCSF) is a spotted microarray. Hybridization of a clinical RNA (cDNA) sample can identify specific viral expression.
The Genome is Organized in Chromatin. Nucleosome Breathing, Opening, and Gaping.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA
Massive Parallel Sequencing
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Genomics and High Throughput Sequencing Technologies: Applications Jim Noonan Department of Genetics.
Aligning Reads Ramesh Hariharan Strand Life Sciences IISc.
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.
EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics.
I519 Introduction to Bioinformatics, Fall, 2012
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Analysis of protein-DNA interactions with tiling microarrays
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Trends Biomedical Science
 CHANGE!! MGL Users Group meetings will now be on the 1 st Monday of each month 3:00-4:00 Room Note the change of time and room.
Lecture-5 ChIP-chip and ChIP-seq
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
Doug Raiford Phage class: introduction to sequence databases.
Next-generation sequencing: the informatics angle
Next-generation sequencing: the informatics angle Gabor T. Marth Boston College Biology Department CHI Next-Generation Data Analysis meeting Providence,
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
Analysis of Next Generation Sequence Data BIOST /06/2015.
Special Topics in Genomics ChIP-chip and Tiling Arrays.
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
De novo Motif Finding using ChIP-Seq
Department of Computer Science
Next-generation DNA sequencing
Basic Local Alignment Search Tool
Presentation transcript:

Next Generation Sequencing

Sequencing techniques ChIP-seq MBD-seq (MIRA-seq) BS-seq RNA-seq miRNA-seq

ChIP-seq ChIP-Seq is a new frontier technology to analyze in vivo protein-DNA interactions. ChIP-Seq Combination of chromatin immunoprecipitation (ChIP) with ultra high-throughput massively parallel sequencing Allow mapping of protein–DNA interactions in-vivo on a genome scale

Workflow of ChIP-Seq Mardis, E.R. Nat. Methods 4, 613-614 (2007)

The advantages of ChIP-seq Current microarray and ChIP-ChIP designs require knowing sequence of interest as a promoter, enhancer, or RNA-coding domain. Lower cost Higher resolution Higher accuracy Alterations in transcription-factor binding in response to environmental stimuli can be evaluated for the entire genome in a single experiment.

Sequencers Solexa (Illumina) 1 GB of sequences in a single run 35 bases in length 454 Life Sciences (Roche Diagnostics) 25-50 MB of sequences in a single run Up to 500 bases in length SOLiD (Applied Biosystems) 6 GB of sequences in a single run

Illumina Genome Analysis System 8 lanes 100 tiles per lane

Sequencing

Sequencer Output Quality Scores Sequence Files

Sequence Files 10-40 million reads per lane ~500 MB files

Quality Score Files Quality scores describe the confidence of bases in each read Solexa pipeline assigns a quality score to the four possible nucleotides for each sequenced base 9 million sequences (500MB file)  ~6.5GB quality score file

Bioinformatics Challenges Rapid mapping of these short sequence reads to the reference genome Visualize mapping results Thousand of enriched regions Peak analysis Peak detection Finding exact binding sites Compare results of different experiments Normalization Statistical tests

Mapping of Short Oligonucleotides to the Reference Genome Mapping Methods Need to allow mismatches and gaps SNP locations Sequencing errors Reading errors Indexing and hashing genome oligonucleotide reads Use of quality scores Use of SNP knowledge Performance Partitioning the genome or sequence reads

Mapping Methods: Indexing the Genome Fast sequence similarity search algorithms (like BLAST) Not specifically designed for mapping millions of query sequences Take very long time e.g. 2 days to map half million sequences to 70MB reference genome (using BLAST) Indexing the genome is memory expensive

SOAP (Li et al, 2008) 2 mismatches or 1-3bp continuous gap Both reads and reference genome are converted to numeric data type using 2-bits-per-base coding Load reference genome into memory For human genome, 14GB RAM required for storing reference sequences and index tables 300(gapped) to 1200(ungapped) times faster than BLAST 2 mismatches or 1-3bp continuous gap Errors accumulate during the sequencing process Much higher number of sequencing errors at the 3’-end (sometimes make the reads unalignable to the reference genome) Iteratively trim several basepairs at the 3’-end and redo the alignment Improve sensitivity

Mapping Methods: Indexing the Oligonucleotide Reads ELAND (Cox, unpublished) “Efficient Large-Scale Alignment of Nucleotide Databases” (Solexa Ltd.) SeqMap (Jiang, 2008) “Mapping massive amount of oligonucleotides to the genome” RMAP (Smith, 2008) “Using quality scores and longer reads improves accuracy of Solexa read mapping” MAQ (Li, 2008) “Mapping short DNA sequencing reads and calling variants using mapping quality scores”

Mapping Algorithm (2 mismatches) Partition reads into 4 seeds {A,B,C,D} At least 2 seed must map with no mismatches Scan genome to identify locations where the seeds match exactly 6 possible combinations of the seeds to search {AB, CD, AC, BD, AD, BC} 6 scans to find all candidates Do approximate matching around the exactly-matching seeds. Determine all targets for the reads Ins/del can be incorporated The reads are indexed and hashed before scanning genome Bit operations are used to accelerate mapping Each nt encoded into 2-bits

ELAND (Cox, unpublished) Commercial sequence mapping program comes with Solexa machine Allow at most 2 mismatches Map sequences up to 32 nt in length All sequences have to be same length

RMAP (Smith et al, 2008) Improve mapping accuracy Possible sequencing errors at 3’-ends of longer reads Base-call quality scores Use of base-call quality scores Quality cutoff High quality positions are checked for mismatces Low quality positions always induce a match Quality control step eliminates reads with too many low quality positions Allow any number of mismatches

Mapped to a unique location Map to reference genome Mapped to a unique location Mapped to multiple locations No mapping Low quality 7.2 M 1.8 M 2.5 M 0.5 M 12 M 3 M Quality filter

Visualization BED files are build to summarize mapping results BED files can be easily visualized in Genome Browser http://genome.ucsc.edu

Visualization: Genome Browser Robertson, G. et al. Nat. Methods 4, 651-657 (2007)

Visualization: Custom 300 kb region from mouse ES cells Mikkelsen,T.S. et al. Nature 448, 553-562 (2007)

Screen shot for ZNF263 peaks Frietze et al JBC 2010

ChIP-seq peak analysis programs SISSRs (Site Identification from Short Sequence Reads): Jothi et al. NAR, 2008. MACS (Model-based Analysis of ChIP-Seq): Zhang et al, Genome Biology, 2008. QuEST (Genome-wide analysis of transcription factor binding sites based on ChIP–seq data): Valouev, A. et al. Nature Methods, 2008. PeakSeq (PeakSeq enables systematic scoring of ChIP–seq experiments relative to controls): Rozowsky, J. et al. Nature Biotech. 2009. FindPeaks (FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology.): Fejes, A .P. et al. Bioinformatisc, 2008. Hpeak (An HMM-based algorithm for defining read-enriched regions from massive parallel sequencing data): Xu et al, Bioinformatics, 2008.

MBD-seq (MIRA-seq) The MBD methyl-CpG binding domain-based (MBDCap) technology to capture the methylation sites. Double stranded methylated DNA fragments can be detected. It is sensitive to different methylation densities Genome-wide sequencing technology was used to get the sequence of each short fragment. The sequenced read was mapped to human genome to find the locations.

BALM – High resolution program for MBD-seq Methylated CpG Unmethylated CpG Fragmentation MBD2 enrichment Elution 500mM 1000mM 2000mM Sequencing and Alignment BALM analysis Tags mapped to forward strand reverse strand BALM 1 BALM 2 Mixture model Scan genome for signal enriched regions Estimate parameters of Bi-asymmetric-Laplace (MLE) Measure tags distribution around target sites Yes Initial scan enriched region using tag shifting method Set t > 0, s = 1 s = t No Decompose the mixture model using Expectation Maximization (EM) s = s + 1 Define hypermethylated regions and methylation score for each CpG dinucleotides Tags distribution BALM Unenriched input Lan et al, PLoS ONE, 2011, 6:e22226

Application on MBD-seq data (MCF7)

BS-seq BS-seq: genomic DNA is treated with sodium bisulphite (BS) to convert cytosine, but not methylcytosine, to uracil, and subsequent high- throughput sequencing. Truly single-base resolution

RNA-seq RNA-Seq is a new approach to transcriptome profiling that uses deep-sequencing technologies. Studies using this method have already altered our view of the extent and complexity of eukaryotic transcriptomes. RNA-Seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods.

RNA-seq protocol

The advantages of RNA-seq Single base resolution High throughput Low background noise Ability to distinguish different isoforms and alleic expression Relatively low cost