Microarray Type Analyses using Second Generation Sequencing

Slides:



Advertisements
Similar presentations
Functional Genomics with Next-Generation Sequencing
Advertisements

Epigenetics Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
MEDIP, MAP AND MIRA Biological Affinity-Based Methods of DNA Methylation Detecton: Genome Wide.
RNAseq.
Visualising and Exploring BS-Seq Data
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Analysis of SAGE Data: An Introduction Kevin R. Coombes Section of Bioinformatics.
Simon v2.3 RNA-Seq Analysis Simon v2.3.
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
Statistical model for count data Speaker : Tzu-Chun Lo Advisor : Yao-Ting Huang.
Data Analysis for High-Throughput Sequencing
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Differentially expressed genes
1 Bayesian inference of genome structure and application to base composition variation Nick Smith and Paul Fearnhead, University of Lancaster.
mRNA-Seq: methods and applications
Committee Meeting April 24 th 2014 Characterizing epigenetic variation in the Pacific oyster (Crassostrea gigas) Claire Olson School of Aquatic and Fishery.
Statistics for Biologists 1. Outline estimation and hypothesis testing two sample comparisons linear models non-linear models application to genome scale.
DNA Methylation Assays High Throughput Data Analysis BIOS , VCU Winter 2010 Mark Reimers, PhD.
Brief workflow RNA is isolated from cells, fragmented at random positions, and copied into complementary DNA (cDNA). Fragments meeting a certain size specification.
Multiple testing in high- throughput biology Petter Mostad.
Expression Analysis of RNA-seq Data
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb MPSS Massively Parallel.
The virochip (UCSF) is a spotted microarray. Hybridization of a clinical RNA (cDNA) sample can identify specific viral expression.
The Genome is Organized in Chromatin. Nucleosome Breathing, Opening, and Gaping.
DNA Methylation mapping
Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.
Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)
RNAseq analyses -- methods
Lecture 11. Microarray and RNA-seq II
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics.
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
The iPlant Collaborative
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Transcriptome Analysis by High-Throughput Sequencing (RNA-Seq) Mark Reimers Virginia Institute for Psychiatric and Behavioral Genetics.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
1 Identifying differentially expressed genes from RNA-seq data Many recent algorithms for calling differentially expressed genes: edgeR: Empirical analysis.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
EDACC Quality Characterization for Various Epigenetic Assays
Analysis of protein-DNA interactions with tiling microarrays
Introduction to RNAseq
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Introduction to epigenetics: chromatin modifications, DNA methylation and the CpG Island landscape Héctor Corrada Bravo CMSC702 Spring 2013 (many slides.
Trends Biomedical Science
Lecture 12 RNA – seq analysis.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Bioinformatics for biologists (2) Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Aim: to provide you with a brief overview of biases in RNA-seq data such that you become aware of this potential problem (and solutions) Biases in RNA-Seq.
Differential Methylation Analysis
Statistics Behind Differential Gene Expression
Simon v RNA-Seq Analysis Simon v
RNA Quantitation from RNAseq Data
Moderní metody analýzy genomu
ParaDIME : (Parallel Differential Methylation analysis)
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
Gene expression estimation from RNA-Seq data
Visualising and Exploring BS-Seq Data
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
Integrative analysis of 111 reference human epigenomes
Differential Expression of RNA-Seq Data
Presentation transcript:

Microarray Type Analyses using Second Generation Sequencing Adam B. Olshen Helen Diller Comprehensive Cancer Center UCSF Division of Biostatistics 5/18/11 Spring 2011, BMI mini course - Statistical Methods for Array and Sequence Data

Outline RNA DNA Methylation

RNA

RNA Sequencing Pipeline Experiment Millions of short reads Map Reads Summarize Counts Normalize Counts Test for Differential Expression Analyze Gene List Stolen from D McCarthy via Terry Speed

Mapped Reads

Summarizing Counts Counts are typically binned to annotated exons, genes, or transcripts. Summarizing to unannotated regions is more difficult.

Summarized Counts

Normalization Normalization is the process in which components of experiments are made comparable before statistical analysis. It is important in sequencing as it was in microarrays! A couple issues in normalization are different sequencing depth (library size) and distributions of reads (long right tails).

Simple RPKM Normalization Proportion of reads: number of reads (n) mapping to an exon (gene) divided by the total number of reads (N), n/N. RPKM: Reads Per Kilobase of exon (gene) per Million mapped sequence reads, 109n/(NL), where L is the length of the transcriptional unit in bp (Mortazavi et al., Nat. Meth., 2008).

Summarized Counts

TMM Normalization

TMM Thought Experiment Suppose samples A and B are sequenced to the same depth, say 9000 reads 90 genes are expressed in A and B truly at the same level 10 genes are expressed at high levels in B but not in A, and no other genes are expressed Possible scenario All 90 genes get about 100 reads for A First 90 genes for B get about 50 reads, while the other 10 genes get about 450 reads each It would appear that the first 90 are expressed twice as high in A as in B! The reason for this result is that there is a fixed amount of sequencing real estate

TMM Example

TMM Solution Trim off the genes with extreme M values Compute scale factor from remaining genes Others normalize by 75th percentile (Bullard et al., BMC Bioinformatics, 2010)

Differential Expression We may want to test for differential expression between/among conditions, disease types, etc. Need a parametric test because few replicates (often 2 or 3 these days) In a parametric test a statistical distribution is assumed for the test statistic (such as Gaussian) unlike nonparametric tests where ranks are used

Methods Based on Counts For microarrays Gaussian-based methods are most common Because sequencing data is counts, statistical distributions for discrete data are used Relevant distributions are Binomial distribution Poisson distribution Negative binomial distribution

The Poisson probability mass function is Poisson Distribution The Poisson probability mass function is Pr(N)=exp(-λ)λN/N!, for rate parameter λ The mean and variance of a Poisson random variable is the same: λ The consensus is that this model is appropriate for technical replicates but that biological replicates have extra variability.

Negative Binomial Distribution The negative binomial distribution is common when count data has variance significantly greater than its mean (overdispersed) The NB distribution has mean λ and variance λ + φλ; as φ goes to 0 it goes to a Poisson It is used to model biological replicates

Negative Binomial Methods Different dispersion (φ) for every gene – not enough data to estimates this Common dispersion (Robinson and Smyth, Biostatistics, 2008) – good, but does not include any gene level variability Moderated dispersion (Robinson and Smyth, Bioinformatics, 2007) – best, but hard to weight gene level vs common dispersion

The Test Say there are two classes, A and B, with counts for gene g of ZgA and ZgB Model the counts as NB taking into account the number of libraries sequenced, the size of those libraries, and the NB parameters λ and φ Test whether ZgA and ZgB are significantly different conditional on the total ZgA + ZgB

EdgeR-Robinson’s Methods R Package Normalization DE

RNA Seq vs Microarrays Mortazavi et al., Nature Methods, 2008

DNA

Copy Number by Sequencing Shen and Zhang, Stanford Statistics Technical Report

Complications of Copy Number by Sequencing Over what region should copy number be sampled? Microarrays sample at a fixed number of probes/SNPs Coverage is highly variable Potentially, a huge amount of computation

Copy Number by Sequencing Let μt represent a non-homogeneous Poisson process representing counts from a case. Let λt represent a non-homogeneous Poisson process representing counts from a control. Let p(t)= μt/(μt+λt). Look for changes in p(t).

Copy Number by Sequencing

SeqCBS Software The method of Shen and Zhang (Stanford Statistics Technical Report) for segmenting sequencing data is called SeqCBS An R package for doing the analysis can be found at CRAN (http://cran.r-project.org/)

Methylation

What is Methylation? ~ 70% of CpGs are methylated in mammals; CpGs are relatively rare A small fraction of the genome, CpG islands, shows near the expected CpG frequency

CpG islands often overlap promoters (sites of transcriptional initiation) Definition of a CpG island (Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J. Mol. Biol. 1987 Jul 20;196(2):261-82): 1. GC content of 50% or greater 2. length greater than 200 bp 3. ratio greater than 0.6 of observed number of CG dinucleotides to the expected number Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G), where N = length of sequence. The promoters of genes often overlap with a type of genomic sequence called a CpG island. Most of the genome has been depleted of CpGs because when the C is methylated, it can easily mutate. However, you can find clusters of CpGs, and they really stand out as landmarks in the genome. There are several definitions. Here is the one we are using. …. At the bottom is the gene gene GAPDH, which is a classis housekeeping gene. The arrows show it is being transcribed from left to right. Each exon is shown as a box. The intervening lines are introns. At the top is the genomic position on chromosome 12. Each nucleotide is numbered. The scale bar is 2 kilobases. You can see the GC percent and individual CpG dinucleotides. The beginning, or 5’ end, of GAPDH overlaps with a region with high GC and many CpGs. It’s about 1 kb in length, and meets the requirements for a CGI. However, CGIs can occur in other locations too – in gene bodies, or in regions where no genes are known.

The DNA Methylome: 28,848,753 CpG sites (Rollins et al, 2006) 32

Current methods for genome-wide DNA methylation analysis Bisulfite sequencing Antibody- or affinity-based enrichment Methyl-sensitive restriction enzymes Limitations: 1. only a small number of the ~28 M CpGs can be interrogated (no longer true!) 2. difficult to analyze repetitive sequences There are 3 common methods that are currently used to study DNA methylation on a genome-wide level, but these have limitations that have prevented analyses from being truly genome-wide. In bisulfite sequencing, chemical treatment converts unmethylated Cs to Us but methylated Cs are protected. Methyl-sensitive restriction enzymes cut DNA based on methylation status of a CpG site within the recognition sequence. Antibody against methylcytosine OR affinity purification methods using protein domains that are known to bind to methylated DNA. In their existing forms, these are all limited in their coverage of the 28 M CpG sites present in the haploid human genome. Also, it is difficult to analyze repetitive sequences, where ~half of DNA methylation occurs, and this is especially a problem with array-based approaches. We wanted to overcome some of these limitations by leveraging advancements in DNA sequencing technology.

Bisulfite Sequencing Xi and Li, BMC Bioinformatics, 2009

Enrichment and Restriction Enzymes Methyl DNA immunoprecipitation - sequencing (MeDIP-seq) higher read density at methylated regions Methyl-sensitive restriction enzyme – sequencing (MRE-seq) each read is a single unmethylated CpG site 5MeC MeDIP-seq MRE-seq MRE digestion

Methylome Methods Comparison Shotgun bisulfite Enrichment Restriction Enzymes Base resolution Absolute quantitation Higher cost/sample 150bp resolution Relative quantitation Much lower cost/sample Low resolution Can be combined with enrichment methods

MethylC, RRBS, MeDIP, MeDIP, MBD Comparison of MethylC, RRBS, MeDIP, MeDIP, MBD Harris et al, NIH Roadmap Epigenome Consortium, Nature Biotechnology, Oct 2010

Things Learned from Whole Genome Methylation Studies Maunakea et al., Nature, 2010 5’ promoter regions of CpG island almost never methylated, while intragenic region can be Methylation of intragenic regions appears to involve alternative promoters

Things Learned from Methylation Studies of Cancer Aberrant methylation of promoter CpG islands can lead to gene silencing (before microarrays) More soon!

Methylation in GBM

Whole Genome Methylation Data is Very Difficult to Analyze! What is the proper scale: CpG level Bin level (how many bins?) Adjacent CpGs or bins are correlated, but not as correlated as copy number where regional segmentation is possible P-values from testing differences between conditions are correlated Huge multiple comparisons problem (28m CpGs) Come back next year for methods discussion

MethylC, RRBS, MeDIP, MeDIP, MBD Comparison of MethylC, RRBS, MeDIP, MeDIP, MBD Harris et al, NIH Roadmap Epigenome Consortium, Nature Biotechnology, Oct 2010

Methylation and Copy Number

References Mortazavi et al., Nat. Meth., 2008. http://www.nature.com/nmeth/journal/v5/n7/full/nmeth.1226.html Robinson and Oshlack, Genome Biology, 2010. http://genomebiology.com/2010/11/3/R25 Bullard et al., BMC Bioinformatics, 2010. http://www.biomedcentral.com/1471-2105/11/94 Robinson and Smyth, Biostatistics, 2008. http://biostatistics.oxfordjournals.org/content/9/2/321.short Robinson and Smyth, Bioinformatics, 2007. http://bioinformatics.oxfordjournals.org/content/23/21/2881.full Robinson et al., Bioinformatics, 2010. http://bioinformatics.oxfordjournals.org/content/26/1/139 Shen and Zhang, Stanford Statistics Technical Report, 2011. http://statistics.stanford.edu/~ckirby/techreports/BIO/BIO%20257.pdf Gardiner-Garden and Frommer, J. Mol. Biol., 1987. http://www.ncbi.nlm.nih.gov/pubmed/3656447 Rollins, Genome Res., 2006. http://www.ncbi.nlm.nih.gov/pubmed/16365381 Xi and Li, BMC Bioinformatics, 2009. http://www.biomedcentral.com/1471-2105/10/232 Harris et al., Nature Biotechnology, 2010. http://www.ncbi.nlm.nih.gov/pubmed/20852635 Maunakea et al., Nature, 2010. http://www.ncbi.nlm.nih.gov/pubmed/20613842