2 Spring CHIBI Courses BMI Foundations I: Bioinformatics (BMSC-GA 4456) Constantin AliferisStudy classic Bioinformatics/Genomics papers and reproduce data analysis, for advanced informatics studentsIntegrative Genomic Data Analysis (BMSC-GA 4453)Jinhua Wangbuild competence in quantitative methods for the analysis of high-throughput genomic dataMicrobiomics Informatics (BMSC-GA 4440)Alexander Alekseyenkoanalysis of microbial community data generated by sequencing technologies: preprocess raw sequencing data into abundance tables, associate abundance with clinical phenotype and outcomes.Next Generation Sequencing (BMSC-GA 4452)Stuart BrownAn overview of Next-Generation sequencing informatics methods for data pre-processing, alignment, variant detection, structural variation, ChIP-seq, RNA-seq, and metagenomics.Proteomics Informatics (BMSC-GA 4437)David FenyoA practical introduction of proteomics and mass spectrometry workflows, experimental design, and data analysis
4 ChIP-seq experimental methods Alignment and data processing Learning ObjectivesChIP-seq experimental methodsTranscription factors and epigeneticsAlignment and data processingFinding peaks: MACS algorithmAnnotationRNA-seq experimental methodsAlignment challenges (splice sites)TopHatCounting reads per geneNormalizationHTSeq-count and CufflinksStatistics of differential expression for RNA-seq
5 ChIP-seq Combine sequencing with Chromatin‐Immunoprecipitaion Select (and identify) fragments of DNA that interact with specific proteins such as:Transcription factors Modified histonesMethylationRNA Polymerase (survey actively transcribe portions of the genome)DNA polymerase (investigate DNA replication)DNA repair enzmes
6 ChIP-chip [Pre-sequencing technology] Do chromatin IP with YFA (Your Favorite Antibody)Take IP-purified DNA fragments, label & hybridize to a microarray containing (putative) promoter (or TF binding) sequences from lots of genesEstimate binding, relate to DNA binding of protein targeted by antibodylimited to well annotated genomesneed to build special microarrayssuffers from hybridization biasassumes all TF binding sites are known and correctly located on genome
7 ChIP-seq High-throughput sequencing Map sequence tags to genome ImmunoprecipitateHigh-throughput sequencingRelease DNAMap sequence tags to genome
8 AlignmentPlace millions of short read sequence ‘tags’ (25-50 bp) on the genomeFinds perfect, 1, and 2 mismatch alignments; no indels (BWA)Aligns ~80% of PF tags to human/mouse genomeWe parse alignment files to get only unique alignments (removes 2%-5% of ‘multi-mapped’ reads)
9 ChIP-seq for TF (SISSRS software) Jothi, et al. Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data. NAR (2008), 36:
12 SaturationHow many sequence reads are needed to find all of the binding targets in the genome?Look for plateau100% = 15,291 peaksRozowsky, et al. Nature Biotech. Vol 27-1, Jan 2009.Pol2 data: 11M reads vs. 12M control reads, peaks found with MACS, data sub-sampled.
13 ChIP-seq ChallengesWe want to find the peaks (enriched regions = protein binding sites on genome)Goals include: accuracy (location of peak on genome), sensitivity, & reproducibilityChallenges: non-random background, PCR artifacts, difficult to estimate false negativesVery difficult to compare samples to find changes in TF binding (many borderline peaks)
14 PeakfindingFind enriched regions on the genome (high tag density) = “peaks”Enriched vs. what?A statistical approach assumes an evenly distributed or randomly distributed backgroundPoisson distribution of background is obviously not trueAny threshold is essentially arbitrary
15 Compare to Background Goal is to make ‘fold change’ measurements What is the appropriate background?Input DNA (no IP)IP with non-specific antibody (IgG)[We mostly use input DNA]Must first identify “peak region” in sample, then compare tag counts vs BG
16 MACSZhang et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol (2008) vol. 9 (9) pp. R137Open source Unix software (Python !)MACS improves the spatial resolution of binding sites through combining the information of sequencing tag position and orientation by using empirical models for the length of the sequenced ChIP fragments(slides + and – strand reads toward center of fragment)MACS uses a dynamic Poisson distribution (local background count in the control) to effectively capture local biases in the genome sequence, allowing for more sensitive and robust predictionUses control to calculate “random” peaks, sets FDR rate.Feng J, Liu T, Zhang Y. Using MACS to identify peaks from ChIP-Seq data. Curr Protoc Bioinformatics Jun;Chapter 2:Unit 2.14.
17 BED formatBED format defines a genomic interval as positions on a reference genome.An interval can be a anything with a location: gene, exon, binding site, region of low complexity, etc.MACS outputs ChIP-seq peaks in BED formatBED files can also specify color, width, some other formatting.chromosome start endchrchrchrchrchrchrchrchrtrack name="ItemRGBDemo" description="Item RGB demonstration" itemRgb="On"chr Pos ,0,0chr Pos ,0,0chr Pos ,0,0chr Pos ,0,0chr Neg ,0,255chr Neg ,0,255chr Neg ,0,255chr Pos ,0,0
20 Remove DuplicatesIn some ChIP-seq samples, PCR amplification of IP-enriched DNA creates artifacts (highly duplicated fragments)Huge differences depending on target of antibody and amount of IP DNA collected.“Complexity” of the library
23 Different IP TargetsHuge difference between Transcription Factors and Histone modification as targets of IPTFsequence-specific binding motifsfew thousand sitesbinding region ~50bporiented tagsyes/no bindingpromoters or enhancersHistone Modsnot sequence-specifictens to hundreds of thousands of siteslarge binding region (~2kb)tags not orientedsignal may be scaledassociated w/ almost all transcribed genes
28 Normalization How to compare lanes with different numbers of reads? Will bias fold-change calculationsSimple method – set all counts in ‘peak’ regions as per million readsThis does not work well for >2x differences in read counts.
30 Evaluation Peaks near promoters of known genes (TSS) Generally a high %As parameters become less stringent, more peaks are found, % near TSS declinesEstimate false positive ratePure statistical (Poisson or Monte Carlo)Compare 2 bg sampels (QuEST)Reverse sample & bg (MACS)Can’t estimate false negative rate – don’t know ‘true’ number of binding sites
31 Evaluation Overlap with ChIP-chip data Reproducibility Synthetic data What is an overlap?What % overlap is good?ReproducibilityNeed to define (we use overlap of 1 bp)Very important for biological conclusionsEssential for comparisons of diff. conditionsMust have replicate samples!!Trade off: reproducibility vs. sensitivitySynthetic dataAllows calculation of sensitivity & specificityHow similar to real data? (All synth has bias)
32 Composite image of sequence reads at promoters of all RefSeq genes. Histone modification (H3K4) ChIP-seqComposite image of sequence reads at promoters of all RefSeq genes.
33 The Use of Next Generation Sequencing to Study Transcriptomes: RNA-seq
34 RNA-seq Measures the Transcriptome Takes advantage of the rapidly dropping cost of Next-Generation DNA sequencingMeasures gene expression in true genome-wide fashion (all the RNA)Also enables detection of mutations (SNPs), alternative splicing, allele specific expression, and fusion genesMore accurate and better dynamic range than MicroarrayCan be used to detect miRNA, ncRNA, and other non-coding RNA
35 RNA-seq Measures the Transcriptome Takes advantage of the rapidly dropping cost of Next-Generation DNA sequencingMeasures gene expression in true genome-wide fashion (all the RNA)Also enables detection of mutations (SNPs), alternative splicing, allele specific expression, and fusion genesMore accurate and better dynamic range than MicroarrayCan be used to detect miRNA, ncRNA, and other non-coding RNA
37 Depth of CoverageWith the Illumina HiSeq producing >200 million reads per sample, what depth of coverage is needed for RNA-seq?Can we multiplex several samples per lane and save $$ on sequencing?For expression profiling (and detection of differentially expressed genes), probably yes, 2-4 samples per lane is practical
38 100 million reads, 81% of genes FPKM ≥ 0.05 Each additional 100 million reads detects ~3% more genesToung, et al. Genome Res June; 21(6): 991–998..
39 Illumina mRNA Sequencing Random primer PCRPoly-A selectionFragment & size-select
40 Sample prep can create 3’ or 5’ bias (strand oriented protocol)no bias (low coverage at endsof transcript)3’ bias(poly-A selection)
41 Detect Small RNAs – depends on sample prep method
43 RNA-seq Alignment Challenges Using RNA-seq for gene expression requires counting sequence reads per geneMust map reads to genes – but this is a more difficult problem than mapping reads to a reference genomeIntrons create big gaps in alignmentSmall reads mean many short overlaps at one end or the other of intron gapsWhat to do with reads that map to introns or outside exon boundaries?What about overlapping genes?
44 TopHatRNA-seq can be used to directly detect alternatively spliced mRNAs.
46 TopHat Trapnell C et al. Bioinformatics 2009;25:1105-1111 The seed and extend alignment used to match reads to possible splice sites.The seed and extend alignment used to match reads to possible splice sites. For each possible splice site, a seed is formed by combining a small amount of sequence upstream of the donor and downstream of the acceptor. This seed, shown in dark gray, is used to query the index of reads that were not initially mapped by Bowtie. Any read containing the seed is checked for a complete alignment to the exons on either side of the possible splice. In the light gray portion of the alignment, TopHat allows a user-specified number of mismatches. Because reads typically contain low-quality base calls on their 3′ ends, TopHat only examines the first 28 bp on the 5′ end of each read by default.Trapnell C et al. Bioinformatics 2009;25:
47 Real data generally support existing annotation Data from Costa lab
49 Count Reads per gene Need a reference genome with exon information How to count partial alignments, novel splices etc?Simple or complex model?Simple: HTSeq-countComplex: CufflinksNormalization methods affect the count very dramatically
50 HTSeq-count A simple Python tool. Relies entirely on an accurate annotation of genes and exons in GFF file.
53 NormalizationDifferential Expression (DE) requires comparison of 2 or more RNA-seq samples.Number of reads (coverage) will not be exactly the same for each sampleProblem: Need to scale RNA counts per gene to total sample coverageSolution – divide counts per million readsProblem: Longer genes have more reads, gives better chance to detect DESolution – divide counts by gene lengthResult = RPKM(Reads Per KB per Million)
54 Better Normalization RPKM assumes: Total amount of RNA per cell is constantMost genes do not change expressionRPKM is invalid if there are a few very highly expressed genes that have dramatic change in expression (dominate the pool of reads)Better to use “Upper Quartile” (75th percentile) or “Quantile” normalizationDifferent normalization methods give different results (different DE genes & different p-value rankings)
55 Statistics of DEmRNA levels are variable in cells/tissues/organisms over time/treatment/tissue etc.Like microarrays, need replicates to separate biological variability from experimental variabilityIf there is high experimental variability, then variance within replicates will be high, statistical significance for DE will be difficult to find.Best methods to discover DE are coupled with sophisticated approaches to normalizationBest to ignore very low expressing genes: RPKM<1
56 Popular DE Statistical methods Cufflinks-Cuffdiffpart of TopHat software suite – easy to useUses FPKM normalizationcomplex model for counting reads among splice variantscan be set to ignore novel variantsEstimates variance in log fold change for each gene using permutationsfinds the most DE genes, high false positive rateedgeRrequires raw count data, does its own normalizationEstimates standard deviation (dispersion) with a weighted combination of individual gene (gene-wise) and global measuresStatistical model is Negative Binomial distribution (has a dispersion parameter)Fisher’s Exact test (for 2-sample), or generalized linear model (complex design)acceptable tradeoff of sensitivity and specificityMany others: DESeq, SAMseq, baySeq.Many rather inconclusive benchmarking studies
61 Novel GenomesRNA-seq can be used to annotate genomes – gene discovery, exon mapping.data from Desplan lab
62 ChIP-seq experimental methods Alignment and data processing SummaryChIP-seq experimental methodsTranscription factors and epigeneticsAlignment and data processingFinding peaks: MACS algorithmAnnotationRNA-seq experimental methodsAlignment challenges (splice sites)TopHatCounting reads per geneNormalizationHTSeq-count and CufflinksStatistics of differential expression for RNA-seq