Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)* –Sequence of aminoacids in a protein molecule A(C+D+E+F+G+H+I+K+L+M+N+P+Q+R+S+T+V+W+X+Y )*Z Two main kind of sequence allignment –Global alignment LGPSSKQTGKGS-SRIWDN | | | | | | | LN—I TKSAGKGAIMRLGDA –Local alignment ----------TGKG------------------ | | | ----------AGKG------------------
Importance of sequence alignment Useful for discovering Functional, structural and evolutionary information. Functional –DNA molecules that are very much alike or `similar’ in sequence analysis parlance probably have the same regulatory role. –Protein molecules that are very much alike probably have the same biochemical function Structural –Protein molecules that are very much alike probably have the same 3-D structure Evolutionary –If two sequences from different organisms are similar then there may have been a common ancestor sequence, and the sequences are then defined as being homologous. –The alignment indicates the changes that could have occurred between the two homologous sequences and a common ancestor sequence during evolution.
Some terminology Homologous: Genes that descended from a common ancestor are called homologs –Sequence homology is different from sequence similarity –The later is a measure of the matching characters in an alignment. –`sequences show 50% homology’ or `the sequences are highly homologous’ are meaningless. –Orthologous: when a lineage splits into two species –Paralogous: when a gene is duplicated in a genome
Global alignement: Needleman-Wunsch algorithm A dynamic programming algorithm Input –Two strings: x and y of length n and m respectively. –Scoring table between the sequence alphabets and gap penalty Output: The alignment with the best score Algorithm terminologies –F(i,j) : The score of the best alignment between the initial segment x 1…i and y 1…j –Boundary values F(0,0) = 0; F(i,0) = -id; F(0,j) = -jd; where d is the gap penalty. –F(i,j) is the maximum of F(i-1, j-1) + matching score between xi and yj F(i-1, j) – d F(I, j-1) -- d Algorithm steps: –Fill the table following an appropriate order –While filling F(i,j) keep an arrow to the slot used in deriving F(i,j) –After F(n,m) is determined, trace back and construct the alignment. Complexity of the algorithm: O(nm). If n =m then O(n 2 ). Note: With biological sequences and standard computers O(n 2 ) algorithms are feasible but a little slow, while O(n 3 ) algorithms are only feasible for very short sequences.
Part of BLOSUM50 scoring matrix HEAGAWGHEE P-2 -2-4-2 A-2505-30-2 W-3 15-3 H100-2 -3-21000 E06-3-3 066 A-2505-30-2 E06 -3-3 066
Illustration of Needleman-Wunsch HEAGAWGHEE 0-8-16-24-32-40-48-56-64-72-80 P -8 A -16 W -24 H -32 E -40 A -48 E -56
Local Alignment: Smith-Waterman algorithm Closely related to the global alignment algorithm. (few differences) Top row and left column now filled with 0s. F(i,j) = maximum of –0 #means starting a new alignment –F(i-1,j-1) + s(x i,y j ) –F(i-1,j) – d –F(i,j-1) -- d Instead of taking the value in the bottom right corner, F(n,m) for the best score, we look for the highest value of F(i,j) over the whole matrix and start the traceback from there. –Traceback ends when we meet a cell with value 0, which corresponds to the start of the alignment.
Illustration of Smith-Waterman HEAGAWGHEE 00000000000 P 0 A 0 W 0 H 0 E 0 A 0 E 0