Previous Lecture: NGS Alignment

Slides:

Advertisements

Similar presentations

RNA-seq library prep introduction

Advertisements

Functional Genomics with Next-Generation Sequencing

Methods to read out regulatory functions

Processing of miRNA samples and primary data analysis

Peter Tsai Bioinformatics Institute, University of Auckland

DEG Mi-kyoung Seo.

RNA-seq: the future of transcriptomics ……. ?

Analysis of ChIP-Seq Data

Data Analysis for High-Throughput Sequencing

Canadian Bioinformatics Workshops

Next-generation sequencing and PBRC. Next Generation Sequencer Applications DeNovo Sequencing Resequencing, Comparative Genomics Global SNP Analysis Gene.

Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Transcriptomics Jim Noonan GENE 760.

Evaluation of Signaling Cascades Based on the Weights from Microarray and ChIP-seq Data by Zerrin Işık Volkan Atalay Rengül Çetin-Atalay Middle East Technical.

RNA-seq Analysis in Galaxy

High Throughput Sequencing

mRNA-Seq: methods and applications

Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics

RNA-Seq and RNA Structure Prediction

Whole Exome Sequencing for Variant Discovery and Prioritisation

Brief workflow RNA is isolated from cells, fragmented at random positions, and copied into complementary DNA (cDNA). Fragments meeting a certain size specification.

Expression Analysis of RNA-seq Data

Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.

Ji-hye Choi August Introduction (2006) ABRF-NGS (the Association fo Biomolecular Resource Facilities next-generation sequencing study)

Todd J. Treangen, Steven L. Salzberg

Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.

SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA

RNAseq analyses -- methods

Massive Parallel Sequencing

Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.

Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day.

RNA-Seq Analysis Simon V4.1.

Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)

Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.

Verna Vu & Timothy Abreo

The iPlant Collaborative

I519 Introduction to Bioinformatics, Fall, 2012

RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.

Sackler Medical School

Next Generation Sequencing

1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:

Transcriptomics Sequencing. over view The transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non coding RNA produced.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.

Introduction to RNAseq

Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.

RNA-seq: Quantifying the Transcriptome

TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.

Lecture-5 ChIP-chip and ChIP-seq

Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.

No reference available

Accessing and visualizing genomics data

CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

HOMER – a one stop shop for ChIP-Seq analysis

Canadian Bioinformatics Workshops

Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.

RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Cancer Genomics Core Lab

Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017

Gene expression from RNA-Seq

RNA-Seq analysis in R (Bioconductor)

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

Sequence Analysis - RNA-Seq 2

Presentation transcript:

Previous Lecture: NGS Alignment

Spring CHIBI Courses BMI Foundations I: Bioinformatics (BMSC-GA 4456) Constantin Aliferis Study classic Bioinformatics/Genomics papers and reproduce data analysis, for advanced informatics students Integrative Genomic Data Analysis (BMSC-GA 4453) Jinhua Wang build competence in quantitative methods for the analysis of high-throughput genomic data Microbiomics Informatics (BMSC-GA 4440) Alexander Alekseyenko analysis of microbial community data generated by sequencing technologies: preprocess raw sequencing data into abundance tables, associate abundance with clinical phenotype and outcomes. Next Generation Sequencing (BMSC-GA 4452) Stuart Brown An overview of Next-Generation sequencing informatics methods for data pre-processing, alignment, variant detection, structural variation, ChIP-seq, RNA-seq, and metagenomics. Proteomics Informatics (BMSC-GA 4437) David Fenyo A practical introduction of proteomics and mass spectrometry workflows, experimental design, and data analysis

This Lecture ChIP-seq & RNA-seq

ChIP-seq experimental methods Alignment and data processing Learning Objectives ChIP-seq experimental methods Transcription factors and epigenetics Alignment and data processing Finding peaks: MACS algorithm Annotation RNA-seq experimental methods Alignment challenges (splice sites) TopHat Counting reads per gene Normalization HTSeq-count and Cufflinks Statistics of differential expression for RNA-seq

ChIP-seq Combine sequencing with Chromatin‐Immunoprecipitaion Select (and identify) fragments of DNA that interact with specific proteins such as: Transcription factors Modified histones Methylation RNA Polymerase (survey actively transcribe portions of the genome) DNA polymerase (investigate DNA replication) DNA repair enzmes

ChIP-chip [Pre-sequencing technology] Do chromatin IP with YFA (Your Favorite Antibody) Take IP-purified DNA fragments, label & hybridize to a microarray containing (putative) promoter (or TF binding) sequences from lots of genes Estimate binding, relate to DNA binding of protein targeted by antibody limited to well annotated genomes need to build special microarrays suffers from hybridization bias assumes all TF binding sites are known and correctly located on genome

ChIP-seq High-throughput sequencing Map sequence tags to genome Immunoprecipitate High-throughput sequencing Release DNA Map sequence tags to genome

Alignment Place millions of short read sequence ‘tags’ (25-50 bp) on the genome Finds perfect, 1, and 2 mismatch alignments; no indels (BWA) Aligns ~80% of PF tags to human/mouse genome We parse alignment files to get only unique alignments (removes 2%-5% of ‘multi-mapped’ reads)

ChIP-seq for TF (SISSRS software) Jothi, et al. Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data. NAR (2008), 36: 5221-31

Saturation How many sequence reads are needed to find all of the binding targets in the genome? Look for plateau 100% = 15,291 peaks Rozowsky, et al. Nature Biotech. Vol 27-1, Jan 2009. Pol2 data: 11M reads vs. 12M control reads, peaks found with MACS, data sub-sampled.

ChIP-seq Challenges We want to find the peaks (enriched regions = protein binding sites on genome) Goals include: accuracy (location of peak on genome), sensitivity, & reproducibility Challenges: non-random background, PCR artifacts, difficult to estimate false negatives Very difficult to compare samples to find changes in TF binding (many borderline peaks)

Peakfinding Find enriched regions on the genome (high tag density) = “peaks” Enriched vs. what? A statistical approach assumes an evenly distributed or randomly distributed background Poisson distribution of background is obviously not true Any threshold is essentially arbitrary

Compare to Background Goal is to make ‘fold change’ measurements What is the appropriate background? Input DNA (no IP) IP with non-specific antibody (IgG) [We mostly use input DNA] Must first identify “peak region” in sample, then compare tag counts vs BG

MACS Zhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137 Open source Unix software (Python !) MACS improves the spatial resolution of binding sites through combining the information of sequencing tag position and orientation by using empirical models for the length of the sequenced ChIP fragments (slides + and – strand reads toward center of fragment) MACS uses a dynamic Poisson distribution (local background count in the control) to effectively capture local biases in the genome sequence, allowing for more sensitive and robust prediction Uses control to calculate “random” peaks, sets FDR rate. Feng J, Liu T, Zhang Y. Using MACS to identify peaks from ChIP-Seq data. Curr Protoc Bioinformatics. 2011 Jun;Chapter 2:Unit 2.14.

BED format BED format defines a genomic interval as positions on a reference genome. An interval can be a anything with a location: gene, exon, binding site, region of low complexity, etc. MACS outputs ChIP-seq peaks in BED format BED files can also specify color, width, some other formatting. chromosome start end chr1 213941196 213942363 chr1 213942363 213943530 chr1 213943530 213944697 chr2 158364697 158365864 chr2 158365864 158367031 chr3 127477031 127478198 chr3 127478198 127479365 chr3 127479365 127480532 track name="ItemRGBDemo" description="Item RGB demonstration" itemRgb="On" chr7 127471196 127472363 Pos1 0 + 127471196 127472363 255,0,0 chr7 127472363 127473530 Pos2 0 + 127472363 127473530 255,0,0 chr7 127473530 127474697 Pos3 0 + 127473530 127474697 255,0,0 chr7 127474697 127475864 Pos4 0 + 127474697 127475864 255,0,0 chr7 127475864 127477031 Neg1 0 - 127475864 127477031 0,0,255 chr7 127477031 127478198 Neg2 0 - 127477031 127478198 0,0,255 chr7 127478198 127479365 Neg3 0 - 127478198 127479365 0,0,255 chr7 127479365 127480532 Pos5 0 + 127479365 127480532 255,0,0

Remove Duplicates In some ChIP-seq samples, PCR amplification of IP-enriched DNA creates artifacts (highly duplicated fragments) Huge differences depending on target of antibody and amount of IP DNA collected. “Complexity” of the library

PCR ‘stacks’ Always in F-R pairs, ~200 bp apart

% of Duplicates varies

Different IP Targets Huge difference between Transcription Factors and Histone modification as targets of IP TF sequence-specific binding motifs few thousand sites binding region ~50bp oriented tags yes/no binding promoters or enhancers Histone Mods not sequence-specific tens to hundreds of thousands of sites large binding region (~2kb) tags not oriented signal may be scaled associated w/ almost all transcribed genes

Mikkelsen, Lander, et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature (2007) 448: 553-562.

H3K4me3

Normalization How to compare lanes with different numbers of reads? Will bias fold-change calculations Simple method – set all counts in ‘peak’ regions as per million reads This does not work well for >2x differences in read counts.

Evaluation Peaks near promoters of known genes (TSS) Generally a high % As parameters become less stringent, more peaks are found, % near TSS declines Estimate false positive rate Pure statistical (Poisson or Monte Carlo) Compare 2 bg sampels (QuEST) Reverse sample & bg (MACS) Can’t estimate false negative rate – don’t know ‘true’ number of binding sites

Evaluation Overlap with ChIP-chip data Reproducibility Synthetic data What is an overlap? What % overlap is good? Reproducibility Need to define (we use overlap of 1 bp) Very important for biological conclusions Essential for comparisons of diff. conditions Must have replicate samples!! Trade off: reproducibility vs. sensitivity Synthetic data Allows calculation of sensitivity & specificity How similar to real data? (All synth has bias)

Composite image of sequence reads at promoters of all RefSeq genes. Histone modification (H3K4) ChIP-seq Composite image of sequence reads at promoters of all RefSeq genes.

The Use of Next Generation Sequencing to Study Transcriptomes: RNA-seq

RNA-seq Measures the Transcriptome Takes advantage of the rapidly dropping cost of Next-Generation DNA sequencing Measures gene expression in true genome-wide fashion (all the RNA) Also enables detection of mutations (SNPs), alternative splicing, allele specific expression, and fusion genes More accurate and better dynamic range than Microarray Can be used to detect miRNA, ncRNA, and other non-coding RNA

RNA-seq Measures the Transcriptome Takes advantage of the rapidly dropping cost of Next-Generation DNA sequencing Measures gene expression in true genome-wide fashion (all the RNA) Also enables detection of mutations (SNPs), alternative splicing, allele specific expression, and fusion genes More accurate and better dynamic range than Microarray Can be used to detect miRNA, ncRNA, and other non-coding RNA

RNA-seq vs. qPCR

Depth of Coverage With the Illumina HiSeq producing >200 million reads per sample, what depth of coverage is needed for RNA-seq? Can we multiplex several samples per lane and save $$ on sequencing? For expression profiling (and detection of differentially expressed genes), probably yes, 2-4 samples per lane is practical

100 million reads, 81% of genes FPKM ≥ 0.05 Each additional 100 million reads detects ~3% more genes Toung, et al. Genome Res. 2011 June; 21(6): 991–998..

Illumina mRNA Sequencing Random primer PCR Poly-A selection Fragment & size-select

Sample prep can create 3’ or 5’ bias (strand oriented protocol) no bias (low coverage at ends of transcript) 3’ bias (poly-A selection)

Detect Small RNAs – depends on sample prep method

RNA-seq informatics workflow: genome mapping splice junction fragments (predict novel junctions/exons) counts normalize differential expression gene lists Oshlack et al. Genome Biology 2010, 11:220

RNA-seq Alignment Challenges Using RNA-seq for gene expression requires counting sequence reads per gene Must map reads to genes – but this is a more difficult problem than mapping reads to a reference genome Introns create big gaps in alignment Small reads mean many short overlaps at one end or the other of intron gaps What to do with reads that map to introns or outside exon boundaries? What about overlapping genes?

TopHat RNA-seq can be used to directly detect alternatively spliced mRNAs.

Map reads to exons & junctions

TopHat Trapnell C et al. Bioinformatics 2009;25:1105-1111 The seed and extend alignment used to match reads to possible splice sites. The seed and extend alignment used to match reads to possible splice sites. For each possible splice site, a seed is formed by combining a small amount of sequence upstream of the donor and downstream of the acceptor. This seed, shown in dark gray, is used to query the index of reads that were not initially mapped by Bowtie. Any read containing the seed is checked for a complete alignment to the exons on either side of the possible splice. In the light gray portion of the alignment, TopHat allows a user-specified number of mismatches. Because reads typically contain low-quality base calls on their 3′ ends, TopHat only examines the first 28 bp on the 5′ end of each read by default. Trapnell C et al. Bioinformatics 2009;25:1105-1111

Real data generally support existing annotation Data from Costa lab

RNA-seq informatics Filter out rRNA, tRNA, mitoRNA Align to genome Find splice junction fragments (join exon boundaries) Differential expression Alternatively spliced transcripts Novel genes/exons Sequence variants (SNPs, indels, translocations) Allele-specific expression

Count Reads per gene Need a reference genome with exon information How to count partial alignments, novel splices etc? Simple or complex model? Simple: HTSeq-count Complex: Cufflinks Normalization methods affect the count very dramatically

HTSeq-count A simple Python tool. Relies entirely on an accurate annotation of genes and exons in GFF file.

Cufflinks Isoform Models

Differential Expression ADM Data from Costa Lab

Normalization Differential Expression (DE) requires comparison of 2 or more RNA-seq samples. Number of reads (coverage) will not be exactly the same for each sample Problem: Need to scale RNA counts per gene to total sample coverage Solution – divide counts per million reads Problem: Longer genes have more reads, gives better chance to detect DE Solution – divide counts by gene length Result = RPKM (Reads Per KB per Million)

Better Normalization RPKM assumes: Total amount of RNA per cell is constant Most genes do not change expression RPKM is invalid if there are a few very highly expressed genes that have dramatic change in expression (dominate the pool of reads) Better to use “Upper Quartile” (75th percentile) or “Quantile” normalization Different normalization methods give different results (different DE genes & different p-value rankings)

Statistics of DE mRNA levels are variable in cells/tissues/organisms over time/treatment/tissue etc. Like microarrays, need replicates to separate biological variability from experimental variability If there is high experimental variability, then variance within replicates will be high, statistical significance for DE will be difficult to find. Best methods to discover DE are coupled with sophisticated approaches to normalization Best to ignore very low expressing genes: RPKM<1

Popular DE Statistical methods Cufflinks-Cuffdiff part of TopHat software suite – easy to use Uses FPKM normalization complex model for counting reads among splice variants can be set to ignore novel variants Estimates variance in log fold change for each gene using permutations finds the most DE genes, high false positive rate edgeR requires raw count data, does its own normalization Estimates standard deviation (dispersion) with a weighted combination of individual gene (gene-wise) and global measures Statistical model is Negative Binomial distribution (has a dispersion parameter) Fisher’s Exact test (for 2-sample), or generalized linear model (complex design) acceptable tradeoff of sensitivity and specificity Many others: DESeq, SAMseq, baySeq. Many rather inconclusive benchmarking studies

DE genes by different methods

Differentially expressed genes 70% of DE genes validated by qPCR Data from Meruelo Lab

Alternative Splicing Data from Costa Lab

Good SNP data from Zavadil lab

Novel Genomes RNA-seq can be used to annotate genomes – gene discovery, exon mapping. data from Desplan lab

ChIP-seq experimental methods Alignment and data processing Summary ChIP-seq experimental methods Transcription factors and epigenetics Alignment and data processing Finding peaks: MACS algorithm Annotation RNA-seq experimental methods Alignment challenges (splice sites) TopHat Counting reads per gene Normalization HTSeq-count and Cufflinks Statistics of differential expression for RNA-seq

Next Lecture: Signal Processing