Microarray Type Analyses using Second Generation Sequencing
Published byModified over 6 years ago
Presentation on theme: "Microarray Type Analyses using Second Generation Sequencing"— Presentation transcript:
1 Microarray Type Analyses using Second Generation Sequencing Adam B. OlshenHelen Diller Comprehensive Cancer CenterUCSF Division of Biostatistics5/18/11Spring 2011, BMI mini course - Statistical Methods for Array and Sequence Data
8 NormalizationNormalization is the process in which components of experiments are made comparable before statistical analysis.It is important in sequencing as it was in microarrays!A couple issues in normalization are different sequencing depth (library size) and distributions of reads (long right tails).
9 Simple RPKM Normalization Proportion of reads: number of reads (n) mapping to an exon (gene) divided by the total number of reads (N), n/N.RPKM: Reads Per Kilobase of exon (gene) per Million mapped sequence reads, 109n/(NL),where L is the length of the transcriptional unit in bp (Mortazavi et al., Nat. Meth., 2008).
12 TMM Thought Experiment Suppose samples A and B are sequenced to the same depth, say 9000 reads90 genes are expressed in A and B truly at the same level10 genes are expressed at high levels in B but not in A, and no other genes are expressedPossible scenarioAll 90 genes get about 100 reads for AFirst 90 genes for B get about 50 reads, while the other 10 genes get about 450 reads eachIt would appear that the first 90 are expressed twice as high in A as in B!The reason for this result is that there is a fixed amount of sequencing real estate
14 TMM SolutionTrim off the genes with extreme M valuesCompute scale factor from remaining genesOthers normalize by 75th percentile(Bullard et al., BMC Bioinformatics, 2010)
15 Differential Expression We may want to test for differential expression between/among conditions, disease types, etc.Need a parametric test because few replicates (often 2 or 3 these days)In a parametric test a statistical distribution is assumed for the test statistic (such as Gaussian) unlike nonparametric tests where ranks are used
16 Methods Based on Counts For microarrays Gaussian-based methods are most commonBecause sequencing data is counts, statistical distributions for discrete data are usedRelevant distributions areBinomial distributionPoisson distributionNegative binomial distribution
17 The Poisson probability mass function is Poisson DistributionThe Poisson probability mass function isPr(N)=exp(-λ)λN/N!, for rate parameter λThe mean and variance of a Poisson random variable is the same: λThe consensus is that this model is appropriate for technical replicates but that biological replicates have extra variability.
18 Negative Binomial Distribution The negative binomial distribution is common when count data has variance significantly greater than its mean (overdispersed)The NB distribution has mean λ and variance λ + φλ; as φ goes to 0 it goes to a PoissonIt is used to model biological replicates
19 Negative Binomial Methods Different dispersion (φ) for every gene – not enough data to estimates thisCommon dispersion (Robinson and Smyth, Biostatistics, 2008) – good, but does not include any gene level variabilityModerated dispersion (Robinson and Smyth, Bioinformatics, 2007) – best, but hard to weight gene level vs common dispersion
20 The TestSay there are two classes, A and B, with counts for gene g of ZgA and ZgBModel the counts as NB taking into account the number of libraries sequenced, the size of those libraries, and the NB parameters λ and φTest whether ZgA and ZgB are significantly different conditional on the total ZgA + ZgB
21 EdgeR-Robinson’s Methods R PackageNormalizationDE
22 RNA Seq vs MicroarraysMortazavi et al., Nature Methods, 2008
24 Copy Number by Sequencing Shen and Zhang, Stanford Statistics Technical Report
25 Complications of Copy Number by Sequencing Over what region should copy number be sampled? Microarrays sample at a fixed number of probes/SNPsCoverage is highly variablePotentially, a huge amount of computation
26 Copy Number by Sequencing Let μt represent a non-homogeneous Poisson process representing counts from a case.Let λt represent a non-homogeneous Poisson process representing counts from a control.Let p(t)= μt/(μt+λt).Look for changes in p(t).
30 What is Methylation?~ 70% of CpGs are methylated in mammals; CpGs are relatively rareA small fraction of the genome, CpG islands, shows near the expected CpG frequency
31 CpG islands often overlap promoters (sites of transcriptional initiation) Definition of a CpG island (Gardiner-Garden M, Frommer M. CpG islands in vertebrate genomes. J. Mol. Biol Jul 20;196(2):261-82):1. GC content of 50% or greater2. length greater than 200 bp3. ratio greater than 0.6 of observed number of CG dinucleotides to the expected numberObs/Exp CpG = Number of CpG * N / (Number of C * Number of G),where N = length of sequence.The promoters of genes often overlap with a type of genomic sequence called a CpG island. Most of the genome has been depleted of CpGs because when the C is methylated, it can easily mutate. However, you can find clusters of CpGs, and they really stand out as landmarks in the genome. There are several definitions. Here is the one we are using. …. At the bottom is the gene gene GAPDH, which is a classis housekeeping gene. The arrows show it is being transcribed from left to right. Each exon is shown as a box. The intervening lines are introns. At the top is the genomic position on chromosome 12. Each nucleotide is numbered. The scale bar is 2 kilobases. You can see the GC percent and individual CpG dinucleotides. The beginning, or 5’ end, of GAPDH overlaps with a region with high GC and many CpGs. It’s about 1 kb in length, and meets the requirements for a CGI. However, CGIs can occur in other locations too – in gene bodies, or in regions where no genes are known.
32 The DNA Methylome: 28,848,753 CpG sites (Rollins et al, 2006)32
33 Current methods for genome-wide DNA methylation analysis Bisulfite sequencingAntibody- or affinity-based enrichmentMethyl-sensitive restriction enzymesLimitations:1. only a small number of the ~28 M CpGs can be interrogated (no longer true!)2. difficult to analyze repetitive sequencesThere are 3 common methods that are currently used to study DNA methylation on a genome-wide level, but these have limitations that have prevented analyses from being truly genome-wide. In bisulfite sequencing, chemical treatment converts unmethylated Cs to Us but methylated Cs are protected. Methyl-sensitive restriction enzymes cut DNA based on methylation status of a CpG site within the recognition sequence. Antibody against methylcytosine OR affinity purification methods using protein domains that are known to bind to methylated DNA. In their existing forms, these are all limited in their coverage of the 28 M CpG sites present in the haploid human genome. Also, it is difficult to analyze repetitive sequences, where ~half of DNA methylation occurs, and this is especially a problem with array-based approaches. We wanted to overcome some of these limitations by leveraging advancements in DNA sequencing technology.
34 Bisulfite SequencingXi and Li, BMC Bioinformatics, 2009
35 Enrichment and Restriction Enzymes Methyl DNA immunoprecipitation - sequencing (MeDIP-seq)higher read density at methylated regionsMethyl-sensitive restriction enzyme – sequencing (MRE-seq)each read is a single unmethylated CpG site5MeCMeDIP-seqMRE-seqMRE digestion
36 Methylome Methods Comparison Shotgun bisulfiteEnrichmentRestriction EnzymesBase resolutionAbsolute quantitationHigher cost/sample150bp resolutionRelative quantitationMuch lower cost/sampleLow resolutionCan be combined with enrichment methods
38 Things Learned from Whole Genome Methylation Studies Maunakea et al., Nature, 20105’ promoter regions of CpG island almost never methylated, while intragenic region can beMethylation of intragenic regions appears to involve alternative promoters
39 Things Learned from Methylation Studies of Cancer Aberrant methylation of promoter CpG islands can lead to gene silencing (before microarrays)More soon!
41 Whole Genome Methylation Data is Very Difficult to Analyze! What is the proper scale:CpG levelBin level (how many bins?)Adjacent CpGs or bins are correlated, but not as correlated as copy number where regional segmentation is possibleP-values from testing differences between conditions are correlatedHuge multiple comparisons problem (28m CpGs)Come back next year for methods discussion
44 ReferencesMortazavi et al., Nat. Meth.,Robinson and Oshlack, Genome Biology,Bullard et al., BMC Bioinformatics,Robinson and Smyth, Biostatistics,Robinson and Smyth, Bioinformatics,Robinson et al., Bioinformatics,Shen and Zhang, Stanford Statistics Technical Report,Gardiner-Garden and Frommer, J. Mol. Biol.,Rollins, Genome Res.,Xi and Li, BMC Bioinformatics,Harris et al., Nature Biotechnology,Maunakea et al., Nature,