Wfleabase.org/docs/tileMEseq0905.pdf Notes and statistics on base level expression May 2009Don Gilbert Biology Dept., Indiana University

Slides:



Advertisements
Similar presentations
Multiple testing and false discovery rate in feature selection
Advertisements

Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize.
ANOVA & Pairwise Comparisons
Lab 1. Overview  Instructor Introduction & Syllabus Distribution Attendance – Don’t miss labs. Assignments – Things are due EVERY week. See calendar/table.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Microarray Data Analysis Statistical methods to detect differentially expressed genes.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Gene Expression Data Analyses (3)
Differentially expressed genes
Statistical Analysis of Microarray Data
1 Data Analysis for Gene Chip Data Part I: One-gene-at-a-time methods Min-Te Chao 2002/10/28.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 5 – Testing for equivalence or non-inferiority. Power.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
False Discovery Rate (FDR) = proportion of false positive results out of all positive results (positive result = statistically significant result) Ladislav.
Multiple testing in high- throughput biology Petter Mostad.
Candidate marker detection and multiple testing
Applying False Discovery Rate (FDR) Control in Detecting Future Climate Changes ZongBo Shang SIParCS Program, IMAGe, NCAR August 4, 2009.
Essential Statistics in Biology: Getting the Numbers Right
Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.
Differential Expression II Adding power by modeling all the genes Oct 06.
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
Statistical significance for genomewide studies John D. Storey and Robert Tibshirani Saurabh Paliwal Topics in Bioinformatics class presentation 11/14/06.
Design of Experiments Problem formulation Setting up the experiment Analysis of data Panu Somervuo, March 20, 2007.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Wfleabase.org/docs/tilexseq0904.pdf What is all this genome expression? Observations and statistics for expression at the base level April 2009Don Gilbert.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
1 Searching for Periodic Gene Expression Patterns Using Lomb-Scargle Periodograms Critical Assessment.
Introduction to Microarrays Dr. Özlem İLK & İbrahim ERKAN 2011, Ankara.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Analysis of Variance (ANOVA) Brian Healy, PhD BIO203.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Statistics for Differential Expression Naomi Altman Oct. 06.
Statistical Testing with Genes Saurabh Sinha CS 466.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
The False Discovery Rate A New Approach to the Multiple Comparisons Problem Thomas Nichols Department of Biostatistics University of Michigan.
Ark nr.: 1 | Forfatter: Øyvind Langsrud - a member of the Food Science Alliance | NLH - Matforsk - Akvaforsk Rotation Tests - Computing exact adjusted.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
The Broad Institute of MIT and Harvard Differential Analysis.
1 Paper Outline Specific Aim Background & Significance Research Description Potential Pitfalls and Alternate Approaches Class Paper: 5-7 pages (with figures)
BIOSTATISTICS Hypotheses testing and parameter estimation.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Hypothesis Testing Steps for the Rejection Region Method State H 1 and State H 0 State the Test Statistic and its sampling distribution (normal or t) Determine.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Affymetrix User’s Group Meeting Boston, MA May 2005 Keynote Topics: 1. Human genome annotations: emergence of non-coding transcripts -tiling arrays: study.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Area Test for Observations Indexed by Time L. B. Green Middle Tennessee State University E. M. Boczko Vanderbilt University.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
Differential Gene Expression
Statistical Testing with Genes
Joseph Rodriguez, Jerome S. Menet, Michael Rosbash  Molecular Cell 
Statistical Testing with Genes
Differential Expression of RNA-Seq Data
Presentation transcript:

wfleabase.org/docs/tileMEseq0905.pdf Notes and statistics on base level expression May 2009Don Gilbert Biology Dept., Indiana University

wfleabase.org/docs/tileMEseq0905.pdf 2007: Tile expression DrosMel tiled by Affymetrix, finds new genes (blue) and known (orange).

wfleabase.org/docs/tileMEseq0905.pdf Precision improves ’06-’09 Measuring expression over gene structures, Nimblegen (08) has higher precision than Affy (06/07) RNA-Seq (09) has higher precision than Nimblegen.

wfleabase.org/docs/tileMEseq0905.pdf … microarray statistics for base level expression?

wfleabase.org/docs/tileMEseq0905.pdf Gene or Base expression? Base-level expression (tiles, rna-seq) calculate like gene differential expression (DE) Per tile, per RNA-seq contig or per base: treatment - control Combine for tiles over gene Independent (technically) observations, but biologically related Increase DF, Power with longer gene How to combine? As independent replicates: gene > (tiles, technical, bio replicates)? As nested block: gene > tiles > replicates ? As gene average: gene = mean(tiles) > replicates ? Compare with gene-level stats …

wfleabase.org/docs/tileMEseq0905.pdf Gene or Base expression? Base level tests find expression better than gene average Base level sensitivity= 42%, Gene level sensitivity= 38% Both have specificity= 37% Sensitivity = 1 - false rejection; Specificity = 1 - false discovery

wfleabase.org/docs/tileMEseq0905.pdf Gene or Base expression? DE is consistent over gene span though expression Ave changes; gene-level measure can miss this. Expression over gene span, treatment(red) vs control(green) with 3 replicates

wfleabase.org/docs/tileMEseq0905.pdf … gene structures & expression

wfleabase.org/docs/tileMEseq0905.pdf Sequence normalizing? Idea is to remove sequence (GC) effects on probe hyb. score TileScope ; Royce TE, Rozowsky JS, and Gerstein, MB. (2007). Assessing the need for sequence-based normalization in tiling microarray experiments. Bioinformatics, 23,

wfleabase.org/docs/tileMEseq0905.pdf Sequence normalizing? Sequence-normalizing also removes Exon/Intron signal ! Don’t use it (TileScope’s quantilenorm).. or other sequence adjustments of expression, unless gene structure signals are included.

wfleabase.org/docs/tileMEseq0905.pdf Intron-Exon Detection Nimblegen and Solexa tile/base expression detects gene structure, on average, fairly well.

wfleabase.org/docs/tileMEseq0905.pdf Intron-Exon Update Newest RNA-Seq finds intron/exon very well (Stranded RNA- Seq, modEncode Gingeras lab, March 2009 )

wfleabase.org/docs/tileMEseq0905.pdf Differential expression Gene end (3’) has more expression, but constant differential over gene span, on average. Green is treatment, red control. Line style shows 3 replicates of Daphnia tiled expression. Example genes introns exons

wfleabase.org/docs/tileMEseq0905.pdf Diff. Expr. distributions Introns show a null DE distribution, genes and TAR regions are wider. Use introns as baseline for statistics? GenesIntronsTARs Pred Sex Metal

wfleabase.org/docs/tileMEseq0905.pdf … multiple testing corrections

wfleabase.org/docs/tileMEseq0905.pdf Multiple statistic tests Problem: perform 20,000 tests and p-values hit laws of chance. Pr = 0.05 can happen 1,000 times by chance (false discovery, FDR). DrosMel Affy line t-tests: 2,284,383 / 5,395,023 = 0.42 Sig Bonferroni: conservative = 0.03 Sig Benjamini & Hochberg: p.adjust(p,’BH’) = 0.35 Sig qvalue(p) : distribution based = 0.41 Sig Storey, JD and R Tibshirani, Statistical significance for genomewide studies. PNAS 100: SAM permutation qvalue However, p.adjust meant for 100’s of tests, not Millions Drosmel modEncode case: 1900 pairwise Affy cell line (62 cells) DE comparisons x 14,000 genes = 26,600,000 t-tests

wfleabase.org/docs/tileMEseq0905.pdf Multiple DE tests : Daphnia Much different corrections for experiments on same genes Daphnia DE: 3 expt.s (trt - con), genes, 3 replicates Predate, Metal genes have low expression, important to detect SexPredateMetals P< %P28310 %BH1900 %Qvalue2100 max P|Q1e-21e-4

wfleabase.org/docs/tileMEseq0905.pdf Multiple statistic tests “Statisticians have turned p-value corrections into an industry, but they are really more of a band-aid than a solution”* What about false rejection (FRR; type II error)? Balance errors, false rejection maybe more important Solution #1: test fewer, directed hypotheses Solution #2: measure error rate on knowns, eg. prediction of “known” genes Solution #3: known null hypothesis, eg. introns *

wfleabase.org/docs/tileMEseq0905.pdf 1900 pairwise Affy cell line DE comparisons x 14,000 genes = 26,600,000 t-tests

wfleabase.org/docs/tileMEseq0905.pdf Hypotheses of interest are fewer: ~100s cells x 14,000 genes ~ 2 Million tests

wfleabase.org/docs/tileMEseq0905.pdf Summary 1.Base-level expression (tiles, rna-seq) measures gene expression better Balances sensitivity (false rejection) with specificity (false discovery) 2.Base-level expression measures gene structures well On average, and precision is improving for individual genes. 3.Multiple test corrections are needed but problematic False discovery corrections for millions of tests leads to false rejections. Determine empirical error rates where possible

wfleabase.org/docs/tileMEseq0905.pdf End note Summary pages wfleabase.org/genome-summaries/tile-expression/ insects.eugenes.org/species/data/dmel5/modencode/ Genome expression maps insects.eugenes.org:8091/gbrowse/cgi-bin/gbrowse/drosmelme/ expression in 52 cell lines (affy) and more precise solexa & nimblegen for a few cell lines insects.eugenes.org:8091/gbrowse/cgi-bin/gbrowse/daphnia_pulex8/ expression among 4 treatment groups (sex, metal stress, biotic predator); nimblegen

wfleabase.org/docs/tileMEseq0905.pdf Differential expression Gene models miss much expression Known sex genes capture DE, but unknown regions capture environmental stress expression, in Daphnia.