Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs.

Similar presentations


Presentation on theme: "The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs."— Presentation transcript:

1 The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs

2 The Basic Local Alignment Search Tool (BLAST) A Y W T Y I V A L T – Q V R Q Y E A T S I L C I V M I Y S R A - Q Y R Y W R Y Most local alignments contain highly conserved sections without gaps

3 The Basic Local Alignment Search Tool (BLAST) A Y W T Y I V A L T – Q V R Q Y E A T S I L C I V M I Y S R A - Q Y R Y W R Y -> search for high scoring segment pairs (HSP), i.e. gap-free local alignments

4 The Basic Local Alignment Search Tool (BLAST)

5 A Y W T Y I V A L T – Q V R Q Y E A T S I L C I V M I Y S R A - Q Y R Y W R Y Advantages: (a) speed (b) statistical theory about HSP exists.

6 The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs (2) Use word pairs as seeds

7 Pair-wise sequence alignment T W L M H C A Q Y I C I M X H X C X T H Y (1) Search word pairs of length 3 with score > T, Use them as seeds.

8 Pair-wise sequence alignment Naïve algorithm would have a complexity of O(l 1 * l 2 ) Solution: Preprocess query sequence: Compile a list of all words that have a Score > T when aligned to a word in the Query.

9 Pair-wise sequence alignment Naïve algorithm would have a complexity of O(l 1 * l 2 ) Solution: Preprocess query sequence: Compile a list of all words that have a Score > T when aligned to a word in the Query. Complexity: O(l 1 ) Organize words in efficient data structure (tree) for fast look-up

10 The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs (2) Use word pairs as seeds (3) Extend seed alignments until score drops below threshold value

11 Pair-wise sequence alignment T W L M H C A Q Y I C I M X H X C X T H Y Extend seeds until score drops by X.

12 Pair-wise sequence alignment T W L M H C A Q Y I C I X M X H X C X T X H X Y Extend seeds until score drops by X.

13 Pair-wise sequence alignment Algorithm not guaranteed to find best segment pair (Heuristic) But works well in practice!

14 The Basic Local Alignment Search Tool (BLAST) New BLAST version (1997) Two-hit strategy

15 Pair-wise sequence alignment W L M H C A Q Y A R V I M X H X C X T H W A X R X v X Search two word pairs of at the same diagonal, use lower threshold T

16 The Basic Local Alignment Search Tool (BLAST) New BLAST version (1997) Two-hit strategy Gapped BLAST Position-Specific Iterative BLAST (PSI BLAST)

17 The Basic Local Alignment Search Tool (BLAST)

18 Multiple sequence alignment 1aboA 1.NLFVALYDfvasgdntlsitkGEKLRVLgynhn..............gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede............deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1.NFRVYYRDsrd......pvwkGPAKLLWkg.................eG 1vie 1.drvrkksga.........awqGQIVGWYctnlt.............peG 1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN......

19 Multiple sequence alignment First question: how to score multiple alignments? Possible scoring scheme: Sum-of-pairs score

20 Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN......

21 Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN......

22 Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP......

23 Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQtkngqGWVPSNYITPVN 1ycsB 39 WWWARlndkeGYVPRNLLGLYP

24 Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN......

25 Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN......

26 Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

27 Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp

28 Multiple sequence alignment Multiple alignment implies pairwise alignments: 1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN......

29 Multiple sequence alignment Multiple alignment implies pairwise alignments: Use sum of scores of these p.a. 1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN......

30 Multiple sequence alignment Goal: Find multi-alignment with maximum score !

31 Multiple sequence alignment Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment Multidimensional search space instead of two- dimensional matrix!

32 Multiple sequence alignment

33 Complexity: For sequences of length l 1 * l 2 * l 3 O( l 1 * l 2 * l 3 ) For n sequences ( average length l ): O( l n ) Exponential complexity!

34 Multiple sequence alignment Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment Optimal solution not feasible:

35 Multiple sequence alignment Needleman-Wunsch coring scheme can be generalized from pair-wise to multiple alignment Optimal solution not feasible: -> Heuristics necessary

36 Multiple sequence alignment (A) Carillo and Lipman (MSA) Find sub-space in dynamic-programming Matrix where optimal path can be found

37 Multiple sequence alignment (B) Stoye, Dress (DCA) Divide search space into small Calculate optimal alignment for sub-spaces Concatenate sub-alignments

38 Multiple sequence alignment (B) Stoye, Dress (DCA)

39 Multiple sequence alignment (B) Stoye, Dress (DCA)

40 Multiple sequence alignment Progressive alignment. Carry out a series of pair-wise alignment

41 Most popular way of constructing multiple alignments: Progressive alignment. Carry out a series of pair-wise alignment Multiple sequence alignment

42 WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP Multiple sequence alignment

43 WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP Align most similar sequences Multiple sequence alignment

44 WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASFQPVAALERIN WLNYNEERGDFPGTYVEYIGRKKISP

45 Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN------ WLN-YNEERGDFPGTYVEYIGRKKISP

46 Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN------ WLN-YNEERGDFPGTYVEYIGRKKISP Align sequence to alignment

47 Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN- WW--RLNDKEGYVPRNLLGLYP- AVVIQDNSDIKVVP--KAKIIRD YAVESEASVQ--PVAALERIN------ WLN-YNEERGDFPGTYVEYIGRKKISP Align alignment to alignment

48 Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN-------- WW--RLNDKEGYVPRNLLGLYP-------- AVVIQDNSDIKVVP--KAKIIRD------- YAVESEA---SVQ--PVAALERIN------ WLN-YNE---ERGDFPGTYVEYIGRKKISP

49 Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN-------- WW--RLNDKEGYVPRNLLGLYP-------- AVVIQDNSDIKVVP--KAKIIRD------- YAVESEA---SVQ--PVAALERIN------ WLN-YNE---ERGDFPGTYVEYIGRKKISP Rule: “once a gap - always a gap”

50 Multiple sequence alignment Order of pair-wise profile alignments determined by phylogenetic tree based on pair-wise similarity values (guide tree)

51 Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP

52 Multiple sequence alignment WCEAQTKNGQGWVPSNYITPVN WWRLNDKEGYVPRNLLGLYP AVVIQDNSDIKVVPKAKIIRD YAVESEAHPGSFQPVAALERIN WLNYNETTGERGDFPGTYVEYIGRKKISP

53 Multiple sequence alignment Problem: simple guide tree determines multiple alignment; multiple alignment determines phyolgeneitc analysis

54 Multiple sequence alignment Implementations: Clustal W, PileUp, MultAlin

55 Local multiple alignment M M

56 M M M

57 M M M M´

58 Local multiple alignment Find motifs contained in all sequences in data set Problem: motifs often present in only sub-families

59 Neither local nor global methods appliccable

60 Alignment possible if order conserved

61 The DIALIGN approach

62 Combination of local and global methods.

63 The DIALIGN approach Combination of local and global methods. Find local pair-wise similarities between input sequences (fragments)

64 The DIALIGN approach Combination of local and global methods. Find local pair-wise similarities between input sequences (fragments) Compose alignments from fragments

65 The DIALIGN approach Combination of local and global methods. Find local pair-wise similarities between input sequences (fragments) Compose alignments from fragments Ignore non-related parts of the sequences

66 The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc

67 The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc

68 The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc

69 The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc

70 The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc ------atctaatagttaaaccccctcgtgcttag-------agatccaaac cagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc--

71 The DIALIGN approach atctaatagttaaactcccccgtgcttagagatccaaac cagtgcgtgtattactaacggttcaatcgcgcacatccgc ------atctaatagttaaaccccctcgtgcttag-------agatccaaac cagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc-- ------atcTAATAGTTAaaccccctcgtGCTTag-------AGATCCaaac cagtgcgtgTATTACTAAc----------GGTTcaatcgcgcACATCCgc--

72 The DIALIGN approach Score of an alignment: Define score of fragment f: l(f) = length of f s(f) = sum of matches (similarity values) P(f) = probability to find a fragment with length l(f) and at least s(f) matches in random sequences that have the same length as the input sequences. Score w(f) = -ln P(f)

73 The DIALIGN approach Score of an alignment: Define score of alignment as sum of scores w(f) of its fragments No gap penalty is used! Optimization problem for pair-wise alignment: Find chain of fragments with maximal total score

74 The DIALIGN approach ------atctaatagttaaaccccctcgtgcttag-------agatccaaac cagtgcgtgtattactaac----------ggttcaatcgcgcacatccgc-- Fragment-chaining algorithm finds optimal chain of fragments.

75 The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

76 The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

77 The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

78 The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

79 The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

80 The DIALIGN approach Multiple fragment alignment atctaatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

81 The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaagagtatcacccctgaattgaataa

82 The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaacggttcaatcgcg caaa--gagtatcacccctgaattgaataa

83 The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgcttag cagtgcgtgtattactaac----------ggttcaatcgcg caaa--gagtatcacc----------cctgaattgaataa

84 The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac----------gg-ttcaatcgcg caaa--gagtatcacc----------cctgaattgaataa

85 The DIALIGN approach Multiple fragment alignment atc------taatagttaaactcccccgtgc-ttag cagtgcgtgtattactaac----------gg-ttcaatcgcg caaa--gagtatcacc----------cctgaattgaataa Consistency: it is possible to introduce gaps such that all segment pairs are aligned.

86 The DIALIGN approach Multiple fragment alignment atc------TAATAGTTAaactccccCGTGC-TTag cagtgcGTGTATTACTAAc----------GG-TTCAATcgcg caaa--GAGTATCAcc----------CCTGaaTTGAATaa

87 Program evaluation Use biologically verified alignments (known 3D structure of proteins) Compare alignments produced by computer programs to “biologically correct” alignments.

88 Program evaluation (1) First evaluation of multiple alignment programs (McClure, Vasi, Fitch,1994) 4 protein families used: Globin, kinase, protease, ribonuclease H, all globally related -> global programs performed best

89 Program evaluation (2) The BAliBASE (Thompson et al., 1999) ~ 100 protein families with known 3D structure, some with large insertions/deletions.

90 Program evaluation 1aboA 1.NLFVALYDfvasgdntlsitkGEKLRVLgynhn..............gE 1ycsB 1 kGVIYALWDyepqnddelpmkeGDCMTIIhrede............deiE 1pht 1 gYQYRALYDykkereedidlhlGDILTVNkgslvalgfsdgqearpeeiG 1ihvA 1.NFRVYYRDsrd......pvwkGPAKLLWkg.................eG 1vie 1.drvrkksga.........awqGQIVGWYctnlt.............peG 1aboA 36 WCEAQt..kngqGWVPSNYITPVN...... 1ycsB 39 WWWARl..ndkeGYVPRNLLGLYP...... 1pht 51 WLNGYnettgerGDFPGTYVEYIGrkkisp 1ihvA 27 AVVIQd..nsdiKVVPRRKAKIIRd..... 1vie 28 YAVESeahpgsvQIYPVAALERIN...... Key alpha helix RED beta strand GREEN core blocks UNDERSCORE

91 Program evaluation Results: Four programs performed best, but no method was best in all test examples. ClustalW, SAGA and RPPR best for global alignment, DIALIGN best for sequences with large insertions or deletions.

92 Program evaluation (3) Lassmann and Sonnhammer (2002) Used BAliBASE plus artificial sequences for local alignment Results: T-COFFEE best for closely related sequences, DIALIGN best for distal sequences.

93 Program evaluation

94 Alignment of large genomic sequences Important tool for identifying functional sites (e.g. genes or regulatory elements)

95 Alignment of large genomic sequences Phylogenetic Footprinting: Functional sites more conserved during evolution => Sequence similarity indicates biological function

96 Alignment of large genomic sequences DIALIGN performs well in identifying local homologies, but is slow

97 Quadratic program running time

98

99

100

101

102

103

104 Solution: Anchored alignments

105

106

107

108

109

110

111 Find anchor points to reduce search space

112 Solution: Anchored alignments Use fast heuristic method to find anchor points: CHAOS developed together with Mike Brudno Brudno et al. (2003), BMC Bioinformatics 4:66

113 Solution: Anchored alignments

114 (3) Anchored alignments

115

116 First step to gene prediction: Exon discovery by genomic alignment

117 Evaluation of different alignment programs: Compare local sequence similarity identified by alignment programs to known exons Morgenstern et al. (2002), Bioinformatics 18:777-787

118 DIALIGN alignment of human and murine genomic sequences

119 DIALIGN alignment of tomato and Thaliana genomic sequences

120 Evaluation of DIALIGN, PipMaker, WABA, BLASTN and TBLASTX on a set of 42 human and murine genomic sequences. Compare similarities to annotated exons Apply cut-off parameter to resulting alignments Measure sensitivity and specificity

121 Performance of long-range alignment programs for exon discovery (human - mouse comparison)

122 Performance of long-range alignment programs for exon discovery (thaliana - tomato comparison)

123 AGenDA: Alignment-based Gene Detection Algorithm Bridge small gaps between DIALIGN fragments -> cluster of fragments Search conserved splice sites and start/stop codons at cluster boundaries to Identify candidate exons Recursive algorithm finds biologically consistent chain of potential exons

124 Identification of candidate exons Fragments in DIALIGN alignment

125 Identification of candidate exons Build cluster of fragments

126 Identification of candidate exons Identify conserved splice sites

127 Identification of candidate exons Candidate exons bounded by conserved splice sites

128 Construct gene models using candidate exons Score of candidate exon (E) based on DIALIGN scores for fragments, score of splice junctions and penalty for shortening / extending Find biologically consistent chain of candidate exons (starting with start codon, ending with stop codon, no internal stop codons …) with maximal total score

129 Find optimal consistent chain of candidate exons

130

131

132

133 atggtaggtagtgaatgtga

134 Find optimal consistent chain of candidate exons atggtaggtagtgaatgtga G1G2

135 Find optimal consistent chain of candidate exons Recursive algorithm calculates optimal chain of candidate exons in N log N time

136 DIALIGN fragments

137 Candidate exons

138 Complete model

139 Results: 105 pairs of genomic sequences from human and mouse (Batzoglou et al., 2000)

140 AGenDA GenScan 64 % 12 % 17 %

141 Results: Quality of AGenDA-based gene models comparable to results from GenScan Exons identified that have not been identified by GenScan No statistical models derived from known genes (no training data necessary!) Method generally appliccable

142 AGenDA: Alignment-based Gene Detection Algorithm WWW server: http://bibiserv/TechFak.Uni-Bielefeld.DE/agenda Rinner, Taher, Goel, Sczyrba, Brudno, Batzoglou, Morgenstern, submitted

143


Download ppt "The Basic Local Alignment Search Tool (BLAST) Rapid data base search tool (1990) Idea: (1) Search for high scoring segment pairs."

Similar presentations


Ads by Google