Presentation is loading. Please wait.

Presentation is loading. Please wait.

S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3.

Similar presentations


Presentation on theme: "S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3."— Presentation transcript:

1 S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3

2 S. Maarschalkerweerd & A. Tjhang2 Overview Probability Theory -Maximum Likelihood -Bayes Theorem Pairwise Alignment -The Scoring Model -Alignment Algorithms

3 S. Maarschalkerweerd & A. Tjhang3 Probability Theory

4 S. Maarschalkerweerd & A. Tjhang4 Probability Theory What is a probabilistic model? Simple example: What is probability of base sequence x 1 x 2 …x n ?  p(x i ), p(x 1 ), p(x 2 )…p(x n ) independent of each other If p C = 0.3; p T = 0.2 and sequence is CTC: P(CTC)=0.3*0.2*0.3=0.018

5 S. Maarschalkerweerd & A. Tjhang5 Maximum Likelihood Estimation Estimate parameters of the model from large sets of examples (training set) – For example: P(T) and P(C) are estimated from their frequency in a database of residues Avoid overfitting – Database too small, model also fits to noise in the training set

6 S. Maarschalkerweerd & A. Tjhang6 Probability Theory Conditional Probability -P(X,Y) = P(X|Y) P(Y) (joint probability) -P(X) =  Y P(X,Y) =  Y P(X|Y) P(Y) (marginal probability)

7 S. Maarschalkerweerd & A. Tjhang7 Bayes’ Theorem P(X|Y) = - Posterior probability Example: P(X)=Probability tumor visible on x-ray P(C)=Probability breast-cancer = 0.01 P(X|C) = 0.9; P(X|¬C) = 0.05 - On the x-ray a tumor is seen. What is the probability that the woman has breast-cancer? P(Y|X) P(X) P(Y)

8 S. Maarschalkerweerd & A. Tjhang8 Pairwise Alignment

9 S. Maarschalkerweerd & A. Tjhang9 Pairwise Alignment Goal: determine whether 2 sequences are related (homologous). Issues regarding pairwise alignment: 1. What sorts of alignment should be considered? 2. The scoring system used to rank alignments. 3. The algorithm used to find optimal (or good) scoring alignments. 4. The statistical methods to evaluate significance of an alignment score.

10 S. Maarschalkerweerd & A. Tjhang10 Example You need a ‘smart’ scoring model to distinguish b from c.

11 S. Maarschalkerweerd & A. Tjhang11 The Scoring Model

12 S. Maarschalkerweerd & A. Tjhang12 The Scoring Model When sequences are related, then both sequences have to be from a common ancestor. – Due to mutation sequences can change. Substitutions Gaps (insertions or deletions) – Natural selection ensures that some mutations are seen more often than others. (Survival of the fittest)

13 S. Maarschalkerweerd & A. Tjhang13 The Scoring Model Total score of an alignment: – Sum of terms for each aligned pair of residues – Terms for each gap Take the sum of those terms

14 S. Maarschalkerweerd & A. Tjhang14 Substitution Matrices We need a matrix with the scores for every possible pair of residues (e.g. bases or amino acids) We can compute these score by: s(a,b) = log( ) p ab = probability that residues a and b have been derived independently from some unknown original residue c. q a = frequency of a p ab qaqbqaqb

15 S. Maarschalkerweerd & A. Tjhang15 BLOSUM50

16 S. Maarschalkerweerd & A. Tjhang16 Gap Penalties  (g) = -gd (linear score)  (g) = -d-(g-1)e (affine score) – d = gap-open penalty – e = gap-extension penalty – g = gap length P(gap) = f(g)  q xi i in gap

17 S. Maarschalkerweerd & A. Tjhang17 Alignment Algorithms

18 S. Maarschalkerweerd & A. Tjhang18 Alignment Algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Repeated matches Overlap matches Hybrid match conditions

19 S. Maarschalkerweerd & A. Tjhang19 Dynamic Programming Enormous amount of possible alignments Algorithm for finding optimal alignment: Use Dynamic Programming Save sub-results for later reuse, avoiding calculation of same problem

20 S. Maarschalkerweerd & A. Tjhang20 Needleman-Wunsch Algorithm Global alignment For sequences of size n and m, make (n+1)x(m+1) matrix Fill in from top left to bottom right F(i-1, j-1) + s(x i,y j ) F(i,j) = maxF(i-1, j) – d F(i, j-1) – d Keep pointer to cell that is used to derive F(i,j) Takes O(nm) time and memory {

21 S. Maarschalkerweerd & A. Tjhang21 Matrix -2 -8 0 -2

22 S. Maarschalkerweerd & A. Tjhang22 Matrix Traceback

23 S. Maarschalkerweerd & A. Tjhang23 Smith-Waterman Algorithm Local alignment Two differences with Needleman-Wunsch: 0 F(i-1, j-1) + s(x i,y j ) F(i-1, j) – d F(i, j-1) – d 2. Local alignment can end anywhere, so choose highest value in matrix from where traceback starts (not necessarily bottom right cell) 1. F(i,j) = max {

24 S. Maarschalkerweerd & A. Tjhang24 Matrix

25 S. Maarschalkerweerd & A. Tjhang25 Smith-Waterman Algorithm Expected score for a random match s(a,b) must be negative There must be some s(a,b) greater than 0 or no alignment is found

26 S. Maarschalkerweerd & A. Tjhang26 Repeated Matches Many local alignments possible if one or both sequences are long. Smith-Waterman only finds one of them Find parts of sequence in the other sequence Not every alignment is useful threshold

27 S. Maarschalkerweerd & A. Tjhang27 Repeated Matches F(i, 0) F(i-1, j-1) + s(x i,y j ) F(i-1, j) – d F(i, j-1) – d F(i-1, 0) F(i-1, j) – T, j = 1,…m; F(i,j) = max { { F(i,0) = max

28 S. Maarschalkerweerd & A. Tjhang28 Matrix Threshold T = 20

29 S. Maarschalkerweerd & A. Tjhang29 Overlap Matches Find match between start of a sequence and end of a sequence (can be the same) Alignment begins on left-hand or top border of the matrix and ends on right-hand or bottom border

30 S. Maarschalkerweerd & A. Tjhang30 Overlap Matches F(0,j) = 0, for j = 1,…,m F(i,0) = 0, for i = 1,…,n F(i-1, j-1) + s(x i,y j ) F(i,j) = max F(i-1, j) – d F(i, j-1) – d {

31 S. Maarschalkerweerd & A. Tjhang31 Matrix

32 S. Maarschalkerweerd & A. Tjhang32 Hybrid Match Conditions Different types of alignment can be created by – adjusting rhs of this formula: F(i,j) = max {…. – adjusting the traceback Example: – We want to align two sequences from the beginning of both the sequences until local alignment has been found.

33 S. Maarschalkerweerd & A. Tjhang33 Summary Probability theory is important for sequence analysis Goal: determine whether 2 sequences are related For that, we need to find an optimal alignment between those sequences using algorithms Scoring model is required to rank different alignments Different algorithms for different types of alignments – use dynamic programming


Download ppt "S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter 1.1-2.3."

Similar presentations


Ads by Google