Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Alignment.

Similar presentations


Presentation on theme: "Sequence Alignment."— Presentation transcript:

1 Sequence Alignment

2 OUTLINE Sequence Alignment
Scoring Alignments and Substitution Matrices Inserting Gaps Dynamic Programming Database Searches

3 Sequence Alignment Comparing sequences for
Similarity Homology Prediction of function of genes and proteins Construction of phylogeny Finding motifs

4 Sequence Alignment - HOMOLOGY
Orthologues : any gene pairwise relation where the ancestor node is a speciation event. Often have similar function Paralogues : any gene pairwise relation where the ancestor node is a duplication event. Paralogs tend to have different functions

5 Sequence Alignment - HOMOLOGY

6 Sequence Alignment - HOMOLOGY

7 Sequence Alignment - PHYLOGENY

8 Sequence Alignment – PROTEIN FUNCTIONS

9 Sequence Alignment – FINDING MOTIFS

10 Scoring Alignments and Substitution Matrices
The quality of an alignment is measured by giving it a quantitative score The simplest way of quatifying similarity between two sequences is percentage identity. Simply measured by counting the number of identical bases or amino acids matched between the aligned sequences.

11 Scoring Alignments and Substitution Matrices
The dot-plot gives a visual assesment of similarity based on identity. [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

12 Scoring Alignments and Substitution Matrices
Percentage identity is a relatively crude measure and does bot give a complete picture of the degree of similarity of two sequences. Scoring identical matches 1 and mismatches as 0 ignores the fact that the type of amino acids involved is highly significant.

13 Scoring Alignments and Substitution Matrices
Genuine matches may not be identical: Seq1: T H I S I S A S E Q U E N C E Seq1: T H A T _ _ _ S E Q U E N C E Isoleucine – Alanine: both hydrophobic Serine – Threonine : both polar

14 Scoring Alignments and Substitution Matrices
Scoring pairs of amino acids: with similar properties  higher scores With different properties  lower scores

15 Scoring Alignments and Substitution Matrices
To assign scores for alignmens use SUBSTITUTION MATRICES [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

16 Scoring Alignments and Substitution Matrices
Different types of substitution matrices are being used based on: The number of mutations required for convertion of one amino acid to the other Similarities in physicochemical properties.

17 Scoring Alignments and Substitution Matrices
PAM substitution matrices: Use closely related protein sequences to derive substitution frequencies Accepted Point Mutations per 100 residues 250 PAM  250 mutation on 100 residues

18 Scoring Alignments and Substitution Matrices
BLOSUM substitution matrices: BLOcks of Amino Acid SUbstitution Matrix Use mutation data from highly conserved local regions BLOSUM 62  62% identity

19 Scoring Alignments and Substitution Matrices
Which matrix to use ? Depends on the problem properties, Distantly related sequences : PAM 250 – BLOSUM 50 Closely related sequences: PAM 120, BLOSUM 80

20 Scoring Alignments and Substitution Matrices
Which matrix to use ? Some special purpose matrices (SLIM and PHAT are designed for membrane proteins) The length of the sequende is important Short sequences  PAM 40 or BLOSUM 80 Long sequences  PAM 250 or BLOSUM 50

21 Scoring Alignments and Substitution Matrices
BLOSUM – and PAM 120 [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

22 Inserting Gaps Gap insertion requires a scoring penalty (gap penalty).
To achieve correct matches gaps are required Alignment programs use gap penalties to limit the introduction of gaps in the alignments

23 Inserting Gaps Insertions tend to be several residues long rather than just a single residue long Fewer insertions and deletions occur in sequences of structural importance Smaller penalty on lengthening an existing gap (gap extension penalty) than introducing a new gap Gap penaly is high  the number of gaps will be decreased Gap penalty is low  more and large gaps will be inserted.

24 Inserting Gaps Choosing gap penalties: Linear Affine Gap open penalty
Gap extension penlty

25 Dynamic Programming Global and Local alignments
Pairwise and Multiple alignments [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

26 Dynamic Programming For a pair of sequences there is a large number of possible alignments. 2 sequences of length 1000 have appriximately different alignments.

27 Dynamic Programming Dynamic Programming:
Problem can be divided into many smaller parts. Optimal alignment will not contain parts that are not themselves optimal. Start from sufficiently short sub-sequences. Alignement is additive:

28 Dynamic Programming Needleman and Wunsch were the first to propose this method. Find optimal global alignments. Align sequences: Seq1: x (x1x2x3…xm) Seq1: y (y1y2y3…yn)

29 Dynamic Programming s(a,b) = score of aligning a and b
F(i,j) = optimal similarity of X(1:i) and Y(1:j) Recurrence relation: – F(i,0) = Σ s(X(k), gap), 0 <= k <= i – F(0,j) =Σ s(gap, B(k)), 0 <= k <= j – F(i,j) = max [ F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)] – Assume linear gap penalty

30 Dynamic Programming Sequences: Gaps: Scoring matrix (BLOSUM-62) 
X = THISLINE, Y = ISALIGNED Gaps: Linear gap penalty (E=8) Scoring matrix (BLOSUM-62) 

31 Dynamic Programming Matrix S of optimal scores of sub-sequence alignments. [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

32 Dynamic Programming [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

33 Dynamic Programming S(I, T) = -1,

34 Dynamic Programming [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

35 Dynamic Programming [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

36 Dynamic Programming I -- H I -- gap gap -- H
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

37 Dynamic Programming S(I, H) = -3, S(I, gap) = -8, S(gap, H) = -8
Recurrence relation: F(i,j) = max [ F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)] [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

38 Dynamic Programming S(I, H) = -3, S(I, gap) = -8, S(gap, H) = -8
Recurrence relation: F(i,j) = max [ F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)] [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

39 Dynamic Programming [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

40 Dynamic Programming Linear gap penalty (E=4)
[“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

41 Dynamic Programming Example:

42 Dynamic Programming Example (cont.):

43 Dynamic Programming Example (cont.):

44 Dynamic Programming Semi – global alignment:
When we treat terminal gaps differently than internal gaps How to modify dynamic programming to be able to make semi – global alignment ?

45 Dynamic Programming Local alignment:
If we compare a sequence to whole genome Find sub-strings whose optimal global alignment value is maximum

46 Dynamic Programming

47 Dynamic Programming What is the difference between global and local alignment ? Can we define the recuernce relation of local alignment similar to global alignment ?

48 Dynamic Programming Recurrence relation of GLOBAL ALIGNMENT:
(Needleman & Wunsch) – F(i,0) = Σ s(X(k), gap), 0 <= k <= i – F(0,j) =Σ s(gap, B(k)), 0 <= k <= j – F(i,j) = max [ F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)]

49 Dynamic Programming Recurrence relation of LOCAL ALIGNMENT:
(Smith-Waterman) – F(i,0) = 0 – F(0,j) = 0 – F(i,j) = max [ 0, F(i,j-1) + s(gap,Y(j), F(i-1,j) + s(X(i),gap), F(i-1, j-1) + s(X(i), Y(j)]

50 Dynamic Programming

51 Dynamic Programming

52 Database Searches FASTA and BLAST Use some heuristics
Dynamic Programming Complexity Time O(n*m) Space O(n*m)

53 Database Searches FASTA
Good local alignment should have some exact match subsequence. Find all k-tuples. (k=1-2 for proteins, 3-6 for DNA sequences) Protein k – tuples  nc, sp, … (k = 2) Nucleotide k – tuples  TAAA, CTCC,…(k = 4)

54 Database Searches FASTA
If k = 3 for nucleotide sequences. There will be 64 possible k – tuples Assign a number e( ): e(A) = 0, e(C) = 1, e(G) = 2, e(T) = 3 Each 3 – tuples are represented as  xi xi+1xi+2 Assign a number to each 3 – tuple Ci = e(xi)42 + e(xi+1)41 + e(xi+2)40 For example: AAA  AAA  = 0 CAA  = 16

55 Database Searches FASTA
Find each occurance of k – tuples in the sequences. Chaining  Look – Up Tables Consider TAAAACTCTAAC (if k = 3): 3 - tuples Position AAA (0) 2, 3 AAC (1) 4, 10 AAG (2) AAT (3)

56 Database Searches FASTA
Find hot – spots on same diagonal [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

57 Database Searches FASTA
Consider the highes scoring alignments [“Understanding Bioinformatics”, M. Zvelebil, J. O. Baum]

58 Database Searches FASTA
Join the best ones. You can use DP

59 Database Searches BLAST
Use short words to search the database sequence. Searches for k – mers that will score above a threshold (T) value when aligned with query k - mer (Remember FASTA looks for k – tuples which are identical). Use a scheme based on finite state automata (Remember FASTA use hashing and chaining fot rapid identification of k - tuples)

60 Database Searches BLAST
From Query Sequence, create query words (for protein sequences word size is 3)

61 Database Searches BLAST
Blast uses a list of high scoring words created from words similar to query words. Considers the words with a score bigger than a threshold value.

62 Database Searches BLAST
Scan each database sequence for an exact match to the list of words. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S".

63 Database Searches BLAST
Keep only the extended matches that have a score at least S. Determine statistical significance of each remaining match.

64 Database Searches BLAST
Database Searches BLAST

65 Database Searches BLAST

66 Database Searches BLAST

67 Database Searches BLAST

68 Database Searches BLAST

69 Database Searches BLAST
[Loxodonta cyclotis] [Elephas maximus maximus] [Elephas maximus indicus]

70 Database Searches HISTORY
1970: NW 1980: SW 1985: FASTA 1989: BLAST

71 References M. Zvelebil, J. O. Baum, “Understanding Bioinformatics”, 2008, Garland Science Andreas D. Baxevanis, B.F. Francis Ouellette, “Bioinformatics: A practical guide to the analysis of genes and proteins”, 2001, Wiley. NCBI HelpDesk - Field Guide Course Materials


Download ppt "Sequence Alignment."

Similar presentations


Ads by Google