Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

Slides:



Advertisements
Similar presentations
What is Chi-Square? Used to examine differences in the distributions of nominal data A mathematical comparison between expected frequencies and observed.
Advertisements

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
1 Analysis of Variance This technique is designed to test the null hypothesis that three or more group means are equal.
Heuristic alignment algorithms and cost matrices
Statistics for the Social Sciences Psychology 340 Fall 2006 Hypothesis testing.
Statistics for the Social Sciences Psychology 340 Spring 2005 Hypothesis testing.
Statistics for the Social Sciences Psychology 340 Fall 2006 Hypothesis testing.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Sequence comparison: Local alignment Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Multiple testing correction
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Slide 1 © 2002 McGraw-Hill Australia, PPTs t/a Introductory Mathematics & Statistics for Business 4e by John S. Croucher 1 n Learning Objectives –Identify.
Comp. Genomics Recitation 3 The statistics of database searching.
Sequence comparison: Dynamic programming Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
Significance in protein analysis
Chi-Square X 2. Review: the “null” hypothesis Inferential statistics are used to test hypotheses Whenever we use inferential statistics the “null hypothesis”
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Statistical Measures Exploring Computer Science Lesson 5-9.
Introduction To Number Systems
9.3 Hypothesis Tests for Population Proportions
Introduction to Genetic Algorithms
Chapter 16: Sample Size “See what kind of love the Father has given to us, that we should be called children of God; and so we are. The reason why the.
Pairwise sequence comparison
Matrix. Matrix Matrix Matrix (plural matrices) . a collection of numbers Matrix (plural matrices)  a collection of numbers arranged in a rectangle.
Exploring Computer Science Lesson 5-9
CHAPTER 11 Inference for Distributions of Categorical Data
Theoretical Normal Curve
Motif discovery GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Success Criteria: I will be able to analyze data about my classmates.
Transcription factor binding motifs
Inferential Statistics
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Significance of similarity scores
Sequence comparison: Traceback and local alignment
Sign test/forensic mini mock
Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
For loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble Notes for 2010: I skipped slide 10. This is.
Sequence comparison: Multiple testing correction
Introduction to Probability and Statistics
Social Science Statistics Module I Gwilym Pryce
Sequence comparison: Dynamic programming
Bring your lunch with you, lockers will be after lunch.
CHAPTER 11 Inference for Distributions of Categorical Data
Chi Square (2) Dr. Richard Jackson
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Local alignment
Contingency Tables.
Sequence comparison: Traceback
CHAPTER 11 Inference for Distributions of Categorical Data
Practice The Neuroticism Measure = S = 6.24 n = 54
Sequence comparison: Multiple testing correction
Motif search GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
While loops Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
CHAPTER 11 Inference for Distributions of Categorical Data
Sequence comparison: Significance of similarity scores
False discovery rate estimation
CHAPTER 11 Inference for Distributions of Categorical Data
HIMS 650 Homework set 5 Putting it all together
Transcription factor binding motifs
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Starter.
Presentation transcript:

Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble

One-minute response: pacing Good pace, good content. Algorithms seem like a good pace. Pacing was good. X2 I liked pacing of the theoretical part and good allocation of time between the two parts. Pacing a bit fast in Python part today. Description of loops seemed a bit slow, but then practice problems seemed fast.

One-minute response: stuff people liked Motif part is excellent As class gets more difficult, it’s nice to have more time for programming problems. x2 Going over the answers stepwise for the practice problems was so helpful! I feel like I made a lot of progress with for loops. Exercises required thought but I was ultimately able to get them to work. Enjoyed class today! Really like the logic and new material. Good class today. Really interested to learn how motif searching works. I’m really enjoying the exercises and homework problems. They’re appropriately difficult and fun!

One-minute response I got confused during the practice problems with looping. I would appreciate more practice with loops. I would like to learn more about the theory/logic behind the programming we are learning. I think I would like the class to be more challenging. It’s a bit slow- paced. It would be great if the current problem could stay on the screen while other slides are brought up. Is there a way to get comments on the homework about what we did wrong? Is it possible to get solutions to the homework? When homework gets graded, can you go over common errors that you saw? One thing I am interested in is analyzing a data file in Python that has headings of columns/rows and then numbers.

Motif Compared to surrounding sequences, motifs experience fewer mutations. Why? Because a mutation inside a motif reduces the chance that the organism will survive. What is an example of the function of a motif? Binding site, phosphorylation site, structural motif. Why do we want to find motifs? To understand the function of the sequence, and to identify distant homologs.

Review: Score this motif occurrence A 1.32 1.32 -0.15 -3.32 -3.32 -0.15 C -3.32 -3.32 -1.00 -3.32 -3.32 -3.32 G -3.32 -1.00 -1.00 -3.32 1.89 -3.32 T 0.38 -0.15 1.07 1.89 -3.32 1.54 0.38 - 1.00 – 1.00 – 3.32 – 3.32 + 1.54 = -6.72 TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG

CTCF One of the most important transcription factors in human cells. Responsible both for turning genes on and for maintaining 3D structure of the DNA.

CTCF binds to a long sequence motif

Searching human chromosome 21 with the CTCF motif

Significance of scores T 0.38 -0.15 1.07 1.89 -3.32 1.54 Motif scanning algorithm 6.30 Low score = not a motif occurrence High score = motif occurrence How high is high enough? TTGACCAGCAGGGGGCGCCG

Two way to assess significance Empirical Randomly generate data according to the null hypothesis. Use the resulting score distribution to estimate p-values. Exact Mathematically calculate all possible scores

CTCF empirical null distribution

Computing a p-value The probability of observing a score >4 is the area under the curve to the right of 4. This probability is called a p- value. p-value = Pr(data|null)

Poor precision in the tail

Converting scores to p-values Linearly rescale the matrix values to the range [0,100] and integerize.

Converting scores to p-values Find the smallest value. Subtract that value from every entry in the matrix. All entries are now non-negative.

Converting scores to p-values 100 / 7 = 14.2857 Find the largest value. Divide 100 by that value. Multiply through by the result. All entries are now between 0 and 100.

Converting scores to p-values Round to the nearest integer.

Converting scores to p-values 0 1 2 3 4 … 400 A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 T 100 43 13 64 Say that your motif has N rows. Create a matrix that has N rows and 100N columns. The entry in row i, column j is the number of different sequences of length i that can have a score of j.

Converting scores to p-values 0 1 2 3 4 … 10 60 100 400 A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 T 100 43 13 64 1 1 1 1 For each value in the first column of your motif, put a 1 in the corresponding entry in the first row of the matrix. There are only 4 possible sequences of length 1.

Converting scores to p-values 0 1 2 3 4 … 10 60 77 100 400 A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 T 100 43 13 64 1 1 1 1 1 For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix.

Converting scores to p-values 0 1 2 3 4 … 10 60 77 100 400 A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 T 100 43 13 64 1 1 1 1 1 For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix. What values will go in row 2? 10+67, 10+39, 10+71, 10+43, 60+67, …, 100+43 These 16 values correspond to all 16 strings of length 2.

Converting scores to p-values 0 1 2 3 4 … 10 60 77 100 400 A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 T 100 43 13 64 1 1 1 1 1 For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix. What values will go in row 2? 10+67, 10+39, 10+71, 10+43, 60+67, …, 100+43 These 16 values correspond to all 16 strings of length 2.

Converting scores to p-values 0 1 2 3 4 … 10 60 77 100 400 A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 T 100 43 13 64 1 1 1 1 1 In the end, the bottom row contains the scores for all possible sequences of length N. Use these scores to compute a p-value.