Presentation is loading. Please wait.

Presentation is loading. Please wait.

Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.

Similar presentations


Presentation on theme: "Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble."— Presentation transcript:

1 Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble

2 One-minute response: pacing
Good pace, good content. Algorithms seem like a good pace. Pacing was good. X2 I liked pacing of the theoretical part and good allocation of time between the two parts. Pacing a bit fast in Python part today. Description of loops seemed a bit slow, but then practice problems seemed fast.

3 One-minute response: stuff people liked
Motif part is excellent As class gets more difficult, it’s nice to have more time for programming problems. x2 Going over the answers stepwise for the practice problems was so helpful! I feel like I made a lot of progress with for loops. Exercises required thought but I was ultimately able to get them to work. Enjoyed class today! Really like the logic and new material. Good class today. Really interested to learn how motif searching works. I’m really enjoying the exercises and homework problems. They’re appropriately difficult and fun!

4 One-minute response I got confused during the practice problems with looping. I would appreciate more practice with loops. I would like to learn more about the theory/logic behind the programming we are learning. I think I would like the class to be more challenging. It’s a bit slow- paced. It would be great if the current problem could stay on the screen while other slides are brought up. Is there a way to get comments on the homework about what we did wrong? Is it possible to get solutions to the homework? When homework gets graded, can you go over common errors that you saw? One thing I am interested in is analyzing a data file in Python that has headings of columns/rows and then numbers.

5 Motif Compared to surrounding sequences, motifs experience fewer mutations. Why? Because a mutation inside a motif reduces the chance that the organism will survive. What is an example of the function of a motif? Binding site, phosphorylation site, structural motif. Why do we want to find motifs? To understand the function of the sequence, and to identify distant homologs.

6 Review: Score this motif occurrence
A C G T – 1.00 – 3.32 – = -6.72 TAATGTTTGTGCTGGTTTTTGTGGCATCGGGCGAGAATAGCGCGTGGTGTGAAAG

7 CTCF One of the most important transcription factors in human cells.
Responsible both for turning genes on and for maintaining 3D structure of the DNA.

8 CTCF binds to a long sequence motif

9 Searching human chromosome 21 with the CTCF motif

10 Significance of scores
T Motif scanning algorithm 6.30 Low score = not a motif occurrence High score = motif occurrence How high is high enough? TTGACCAGCAGGGGGCGCCG

11 Two way to assess significance
Empirical Randomly generate data according to the null hypothesis. Use the resulting score distribution to estimate p-values. Exact Mathematically calculate all possible scores

12 CTCF empirical null distribution

13 Computing a p-value The probability of observing a score >4 is the area under the curve to the right of 4. This probability is called a p- value. p-value = Pr(data|null)

14 Poor precision in the tail

15 Converting scores to p-values
Linearly rescale the matrix values to the range [0,100] and integerize.

16 Converting scores to p-values
Find the smallest value. Subtract that value from every entry in the matrix. All entries are now non-negative.

17 Converting scores to p-values
100 / 7 = Find the largest value. Divide 100 by that value. Multiply through by the result. All entries are now between 0 and 100.

18 Converting scores to p-values
Round to the nearest integer.

19 Converting scores to p-values
A C G T Say that your motif has N rows. Create a matrix that has N rows and 100N columns. The entry in row i, column j is the number of different sequences of length i that can have a score of j.

20 Converting scores to p-values
A C G T 1 1 1 1 For each value in the first column of your motif, put a 1 in the corresponding entry in the first row of the matrix. There are only 4 possible sequences of length 1.

21 Converting scores to p-values
A C G T 1 1 1 1 1 For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix.

22 Converting scores to p-values
A C G T 1 1 1 1 1 For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix. What values will go in row 2? 10+67, 10+39, 10+71, 10+43, 60+67, …, These 16 values correspond to all 16 strings of length 2.

23 Converting scores to p-values
A C G T 1 1 1 1 1 For each value x in the second column of your motif, consider each value y in the zth column of the first row of the matrix. Add y to the x+zth column of the matrix. What values will go in row 2? 10+67, 10+39, 10+71, 10+43, 60+67, …, These 16 values correspond to all 16 strings of length 2.

24 Converting scores to p-values
A C G T 1 1 1 1 1 In the end, the bottom row contains the scores for all possible sequences of length N. Use these scores to compute a p-value.

25


Download ppt "Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble."

Similar presentations


Ads by Google