Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Pairwise Sequence Alignment. 2 Biological motivation Main algorithms for pairwise sequences alignment ATTGCGTCGATCGCAC-GCACGCT ATTGCAGTG-TCGAGCGTCAGGCT.

Similar presentations


Presentation on theme: "1 Pairwise Sequence Alignment. 2 Biological motivation Main algorithms for pairwise sequences alignment ATTGCGTCGATCGCAC-GCACGCT ATTGCAGTG-TCGAGCGTCAGGCT."— Presentation transcript:

1 1 Pairwise Sequence Alignment

2 2 Biological motivation Main algorithms for pairwise sequences alignment ATTGCGTCGATCGCAC-GCACGCT ATTGCAGTG-TCGAGCGTCAGGCT CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT Global alignment

3 3 Biological motivation Main algorithms for pairwise sequences alignment ATTGCGTCGATCGCAC-GCACGCT ATTGCAGTG-TCGAGCGTCAGGCT CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT local alignment

4 4 Discover function Sequences that are similar probably have the same function

5 5 Study evolution If two sequences from different organisms are similar, they may have been a common ancestor

6 6 Find crucial features –Regions in the sequences that are strongly conserved between different sequences can indicate their functional importance Conservation of the IGFALS (Insulin-like growth factor) Between human and mouse.

7 7 Identify cause of disease –Comparison of sequences between individuals can detect changes that are related to diseases

8 8 Sickle Cell Anemia Due to 1 swapping an A for a T, causing inserted amino acid to be valine instead of glutamine in hemoglobin Image source: http://www.cc.nih.gov/ccc/ccnews/nov99/

9 9 Healthy Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GG A GAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTP E EKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH

10 10 Diseased Individual >gi|28302128|ref|NM_000518.4| Homo sapiens hemoglobin, beta (HBB), mRNA ACATTTGCTTCTGACACAACTGTGTTCACTAGCAACCTCAAACAGACACCATGGTGCATCTGACTCCTGA GG T GAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTGAACGTGGATGAAGTTGGTGGTGAGGCCCTGGGC AGGCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCCTGATG CTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCTTTAGTGATGGCCTGGC TCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGAT CCTGAGAACTTCAGGCTCCTGGGCAACGTGCTGGTCTGTGTGCTGGCCCATCACTTTGGCAAAGAATTCA CCCCACCAGTGCAGGCTGCCTATCAGAAAGTGGTGGCTGGTGTGGCTAATGCCCTGGCCCACAAGTATCA CTAAGCTCGCTTTCTTGCTGTCCAATTTCTATTAAAGGTTCCTTTGTTCCCTAAGTCCAACTACTAAACT GGGGGATATTATGAAGGGCCTTGAGCATCTGGATTCTGCCTAATAAAAAACATTTATTTTCATTGC >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTP V EKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH

11 11 Sequence Modifications Three types of mutation –Substitution (point mutation) –Insertion –Deletion TCAGTTCGAGT TCCGT TCGT TCAGT Indel (replication slippage)

12 12 How do we quantitate similarity?

13 13 Scoring Similarity Assume independent mutation model –Each site considered separately Score at each site –Positive if the same –Negative if different Sum to make final score –Can be positive or negative –Significance depends on sequence length GTAGTC CTAGCG

14 14 Substitutions Only Pretend there are no indels –Sequences compared base-by-base –Count the number of matches and mismatches –Matches score +2, Mismatches score -1 TTCGTCGTAGTCGGCTCGACCTG GTACGTCTAGCGAGCGTGATCCT 9 matches+18 14 mismatches-14 Total score +4 A weak match

15 15 Including Indels Create an ‘alignment’ –Count matches within alignment –Required if sequences are different length TT-CGTCGTAGTCG-GC-TCGACC-TG GTACGTC-TAG-CGAGCGT-GATCCT- 17 matches+34 2 mismatches- 2 8 indels- 8 Total score +24 A strong match

16 16 Choosing an Alignment Many different alignments are possible –Should consider all possible –Take the best score found –There may be more than one best alignment TT-CGTCGTAGTCG-GC-TCGACC-TG GTACGTC-TAG-CGAGCGT-GATCCT- +24 -TTCGT-CGTAGTC-GGCTCG-ACCTG GTAC-GTCTA-GCGAGCGT-GATCC-T 0

17 17 Why is it hard ? Alignment (without gaps) requires an algorithm that performs a number of comparisons roughly proportional to the square of the average sequence length. If we include gaps the number of comparisons becomes astronomical

18 18 Algorithms for pairwise alignments Dot Plots – Gibbs and McIntyre 1970 Dynamic Programming : Local alignment : Smith- Waterman Global alignment :Needelman-Wunsch

19 19 Dot Plots Early method Sequences at top and left Dots indicate matched bases Diagonal series show matched regions GTAGTCGG T  A  G  C  G  A  G  C  TAGTCG TAG-CG

20 20 Dynamic Programming A method for reducing a complex problem to a set of identical sub-problems The best solution to one sub-problem is independent from the best solution to the other sub-problem

21 21 Dynamic Programming A method for reducing a complex problem to a set of identical sub-problems The best solution to one sub-problem is independent from the best solution to the other sub-problem

22 22 what does it mean? If a path from X→Z passes through Y, the best path from X→Y is independent of the best path from Y→Z

23 23 Example Sequences: A = ACGCTG, B = CATGT A C G C T G C A T G T ?

24 24 Example 23 4 5 Score of best alignment between AC and CATG Sequences: A = ACGCTG, B = CATGT -2 …between AC and CATGT 2 …between ACG and CATG Calculate score between ACG and CATGT ? Match:+2, Other:-1

25 25 Needleman-Wunsch Example 23 4 5 Insertion in the first sequence Align the next letter in sequence 1 and 2 Insertion in the Second sequence

26 26 Sequences: A = ACGCTG, B = CATGT Needleman-Wunsch Example 23 4 5 -2 2 -1 from before plus -1 for mismatch of G against T  -2 2 from before plus -1 for mismatch of – against T  1 -2 from before plus -1 for mismatch of G against –  -3 1 Cell gets highest score of -2, 1, -3  1

27 27 Sequences: A = ACGCTG, B = CATGT Needleman-Wunsch Example 23 4 51 -2 2

28 28 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0 C 1 A 2 T 3 G 4 T 5

29 29 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0 C 1 A 2 T 3 G 4 T 5 A-A-

30 30 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0-2-3-4-5-6 C 1 A 2 T 3 G 4 T 5 ACGCTG ------

31 31 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0-2-3-4-5-6 C 1 A 2 -2 T 3 -3 G 4 -4 T 5 -5 ----- CATGT

32 32 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0-2-3-4-5-6 C 1 A 2 -2 T 3 -3 G 4 -4 T 5 -5 ACAC

33 33 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0-2-3-4-5-6 C 1 1 A 2 -2 T 3 -3 G 4 -4 T 5 -5 AC -C

34 34 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0-2-3-4-5-6 C 1 10 A 2 -2 T 3 -3 G 4 -4 T 5 -5 ACG -C-

35 35 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0-2-3-4-5-6 C 1 10 A 2 -2 T 3 -3 G 4 -4 T 5 -5 ACGC -C-- ACGC ---C

36 36 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0-2-3-4-5-6 C 1 10 -2-3 A 2 -2100 T 3 -3 G 4 -4 T 5 -5 ACG -CA

37 37 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0-2-3-4-5-6 C 1 10 -2-3 A 2 -2100-2-3 T 3 -300 10 G 4 -4 2103 T 5 -5-2 1132

38 38 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0-2-3-4-5-6 C 1 10 -2-3 A 2 -2100-2-3 T 3 -300 10 G 4 -4 2103 T 5 -5-2 1132

39 39 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0 C 1 10 A 2 10 T 3 01 G 4 213 T 5 32

40 40 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0 C 1 10 A 2 10 T 3 01 G 4 213 T 5 32 ACGCTG- -C-ATGT

41 41 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0 C 1 10 A 2 10 T 3 01 G 4 213 T 5 32 ACGCTG- -CA-TGT

42 42 0 A1A1 C2C2 G3G3 C4C4 T5T5 G6G6 0 0 C 1 10 A 2 10 T 3 01 G 4 213 T 5 32 -ACGCTG CATG-T-

43 43 Needleman-Wunsch Alignment Global alignment between sequences –Compare entire sequence against another Create scoring table –Sequence A across top, B down left Cell at column i and row j contains the score of best alignment between the first i elements of A and the first j elements of B –Global alignment score is bottom right cell Summary

44 44 Global vs. Local alignment DOROTHY HODGKIN Global alignment: DOROTHY--------HODGKIN DOROTHYCROWFOOTHODGKIN Local alignment:

45 45 Global Alignment versus Local Alignment ATTGCAGTG-TCGAGCGTCAGGCT ATTGCGTCGATCGCAC-GCACGCT Global Alignment Local Alignment CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT

46 46 Local Alignment Best score for aligning part of sequences –Often beats global alignment score Similar algorithm: Smith-Waterman –Table cells never score below zero

47 47 Local Alignment How do we do it ? 1.We can start a new match instead of extending a previous alignment. –This means- at each cell, we can start to calculate the score from 0 (even if this means ignoring the prefix). –We do this only if it’s better than the alternative (which means- only if the alternative is negative). 2.Instead of looking only at the far corner, we look anywhere in the table for the best score (even if this means ignoring the suffix)


Download ppt "1 Pairwise Sequence Alignment. 2 Biological motivation Main algorithms for pairwise sequences alignment ATTGCGTCGATCGCAC-GCACGCT ATTGCAGTG-TCGAGCGTCAGGCT."

Similar presentations


Ads by Google