“Hotspot” algorithm chr5:131,975,056-132,012,092 Idea: gauge enrichment of tags relative to a local background model based on the number of tags in a 50kb.

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Chapter 12: Testing hypotheses about single means (z and t) Example: Suppose you have the hypothesis that UW undergrads have higher than the average IQ.
THE DISTRIBUTION OF SAMPLE MEANS How samples can tell us about populations.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Psychology 10 Analysis of Psychological Data February 26, 2014.
RESEARCH METHODOLOGY & STATISTICS LECTURE 6: THE NORMAL DISTRIBUTION AND CONFIDENCE INTERVALS MSc(Addictions) Addictions Department.
Generated Waypoint Efficiency: The efficiency considered here is defined as follows: As can be seen from the graph, for the obstruction radius values (200,
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Chapter 8 Hypothesis Testing I. Significant Differences  Hypothesis testing is designed to detect significant differences: differences that did not occur.
Lecture 5: Learning models using EM
Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation.
Hypothesis Testing Lecture 4. Examples of various hypotheses The sodium content in Furresøen is x Sodium content in Furresøen is equal to the content.
Module C9 Simulation Concepts. NEED FOR SIMULATION Mathematical models we have studied thus far have “closed form” solutions –Obtained from formulas --
Copyright © 2014, 2013, 2010 and 2007 Pearson Education, Inc. Chapter Hypothesis Tests Regarding a Parameter 10.
Chapter 11: Random Sampling and Sampling Distributions
Lecture II-2: Probability Review
Normal Probability Distributions Chapter 5. § 5.1 Introduction to Normal Distributions and the Standard Distribution.
Supplementary Material Epigenetic histone modifications of human transposable elements: genome defense versus exaptation Ahsan Huda, Leonardo Mariño-Ramírez.
Hypothesis Testing.
Objectives 1.2 Describing distributions with numbers
Chapter 7: The Normal Probability Distribution
A P STATISTICS LESSON 2 – 2 STANDARD NORMAL CALCULATIONS.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Random Variables Numerical Quantities whose values are determine by the outcome of a random experiment.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Continuous Random Variables Chapter 6.
Chapter 6 Normal Probability Distribution Lecture 1 Sections: 6.1 – 6.2.
Modular 11 Ch 7.1 to 7.2 Part I. Ch 7.1 Uniform and Normal Distribution Recall: Discrete random variable probability distribution For a continued random.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
5.3 Random Variables  Random Variable  Discrete Random Variables  Continuous Random Variables  Normal Distributions as Probability Distributions 1.
Supplementary Figure S1 Percentage of peaks from Trf1 +/+ p53 -/- -Cre vs Trf1  /  p53 -/- -Cre comparison that are located in non subtelomeric and subtelomeric.
HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Section 7.3.
Describing Location in a Distribution Chapter 2. 1.Explain what is meant by a standardized value. 2. Compute the z-score of an observation given the mean.
I519 Introduction to Bioinformatics, Fall, 2012
Normal Probability Distributions Chapter 5. § 5.1 Introduction to Normal Distributions and the Standard Distribution.
Significance Test A claim is made. Is the claim true? Is the claim false?
Large sample CI for μ Small sample CI for μ Large sample CI for p
EDACC Quality Characterization for Various Epigenetic Assays
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
The Standard Normal Distribution Section 5.2. The Standard Score The standard score, or z-score, represents the number of standard deviations a random.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
§ 5.3 Normal Distributions: Finding Values. Probability and Normal Distributions If a random variable, x, is normally distributed, you can find the probability.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Investigate Variation of Chromatin Interactions in Human Tissues Hiren Karathia, PhD., Sridhar Hannenhalli, PhD., Michelle Girvan, PhD.
Modeling Distributions
Chapter 3 The Normal Distributions. Chapter outline 1. Density curves 2. Normal distributions 3. The rule 4. The standard normal distribution.
Normal Distributions.
Chapter 18 - Part 2 Sampling Distribution Models for.
Normal Distributions (aka Bell Curves, Gaussians) Spring 2010.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 5. Measuring Dispersion or Spread in a Distribution of Scores.
From the population to the sample The sampling distribution FETP India.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 6 Random Variables 6.1: Discrete and Continuous.
Hypothesis test flow chart
 A standardized value  A number of standard deviations a given value, x, is above or below the mean  z = (score (x) – mean)/s (standard deviation)
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
2.2 Standard Normal Calculations. Empirical vs. Standardizing Since all normal distributions share common properties, we have been able to use the
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Normal Probability Distributions 1 Larson/Farber 4th ed.
The Normal Distributions.  1. Always plot your data ◦ Usually a histogram or stemplot  2. Look for the overall pattern ◦ Shape, center, spread, deviations.
Chapter 7 Continuous Probability Distributions and the Normal Distribution.
CHAPTER 6 Random Variables
The normal distribution
Hypothesis Testing for Means (Small Samples)
Section 7.3: Probability Distributions for Continuous Random Variables
Igor V. Cadez, Padhraic Smyth, Geoff J. Mclachlan, Christine and E
More about Posterior Distributions
Protein Occupancy Landscape of a Bacterial Genome
Power Section 9.7.
Normal Probability Distribution Lecture 1 Sections: 6.1 – 6.2
Presentation transcript:

“Hotspot” algorithm chr5:131,975, ,012,092 Idea: gauge enrichment of tags relative to a local background model based on the number of tags in a 50kb surrounding window. Hotspots (height = score)

“Hotspot” algorithm Enrichment is measured as a z-score based on the binomial distribution null model. 250 bp 50kb Each tag in the large window is considered an “experiment,” with probability of success (landing in the smaller window) n tags N tags (adjusted for uniquely mapping bases) Given N tags in the large window, expected number of tags in smaller window is

“Hotspot” algorithm 250 bp 50kb n tags N tags Given N tags in the large window, expected number of tags in smaller window is The standard deviation for the expected number of tags in the smaller window is And the z-score for the observed number of tags in the smaller window is

“Hotspot” algorithm Each tag gets a z-score for the 250bp and 50kb windows centered on it. A hotspot is a succession of tags within a 250bp window, each of whose z-score is greater than 2. The hotspot is scored with the z-score for the 250bp window centered on those tags. hotspot

Examples of different kinds of hotspots 1.Monsters 2.Noisy regions

Shadowed hotspots Problem: regions of very high enrichment can inflate the background for neighboring regions, deflating z-scores chr1:604, ,350 Same as above, rescaled These would be highly significant in isolation, but are missed due to shadowing by the monster.

Shadowed hotspots Solution: implement a two-pass hotspot detection scheme. 1.Run first pass of hotspot detection 2.Delete all tags falling in the first-pass hotspots 3.Compute new hotspots with deleted background 4.Combine hotspots from first and second passes, and re-score all using the deleted background: all 50kb windows will only include tags from deleted background. Pass 1 Deleted background Pass 2

Hotspots are robust to regions of duplication chr8:129,897, ,347,975 chr8:130,151, ,201,725 chr8:129,904, ,979,850 Called peaks (height = z-score) Disparate peak heights, but comparable z-scores

Random Tags As a null model for doing FDR calculations, we generate tags uniformly over the uniquely mappable (for 27-mers) bases of the genome. We use the same number of tags for observed and random data. Observed tags Random tags The random data still coalesce into hotspots. Observed hotspots Random hotspots

Properties of Random Tags Still lots of hotspots! 146,752 in random data with same number of tags as observed 395,433 in observed (GM)

Properties of Random Tags Enriched in promoters?! (Yes, slightly, since uniquely mappable 27-mers are enriched in promoters.) Distance to Tx start sites Average tag density

FDR Calculations Using Random Tags FDR(z-score = T) = # of random peaks with z >=T # of observed peaks with z >=T This is probably conservative, since numerator is likely an overestimate of the number of false positives in the observed data. Observed Random

Extending to multiple cell types Call a location multi-cell verified (MCV) if hotspot peaks from different cell types overlap there (after fattening peaks to 300bp). Score these MCV zones with the maximum z-score over the cell type peaks. MCV peaks are then identified by looking at the summed density in the zones. Repeat with multiple random datasets to get random MCV peaks for FDR calc’s. MCV zones Summed density MCV peaks chr5:131,585, ,597,894 (GM and BJ)