Presentation is loading. Please wait.

Presentation is loading. Please wait.

Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching.

Similar presentations


Presentation on theme: "Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching."— Presentation transcript:

1 Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching

2 Aidan Budd, EMBL Heidelberg Residues: Monomers within a polymer (polypeptide or polynucleotide) chain residues "Anatomy" of a Sequence Alignment WKKLGSNVG WGKVKNVD sequence s Sequences: List of residues in a polymer chain......listed in the same order they occur within the polymer

3 Aidan Budd, EMBL Heidelberg 1:1 residue correspondences/ relationships sequence s residues "Anatomy" of a Sequence Alignment WKKLGSNVG WGKVKNVD 1:1 residue correspondences/relationships Correspondences between a single residue in one sequence and a single residue in another sequence

4 Aidan Budd, EMBL Heidelberg Could perhaps say there is a "1:2" relationship between this residue and these residues Residue has no equivalent in the top sequence i.e. no residue in the top sequence has a 1:1 relationship with this residue 1:1 residue correspondences/ relationships "Anatomy" of a Sequence Alignment WKKLGSNVG WGKVKNVD However, alignments focus on 1:1 relationships

5 Aidan Budd, EMBL Heidelberg Often represented using a grid/matrix: WKKLGSNVG WGKVKNVD Sequence Alignment Within a Grid WKKLGSNVG WGKVKNVD One sequence per row - - Gap characters (usually " - ") indicate that the sequence contains no residues 'equivalent' to other residues in that column - Residues in the same column are 'equivalent'

6 Aidan Budd, EMBL Heidelberg Multiple Sequence Alignment WKKLGSNVG WGKV--NVD -AKV---VD

7 Aidan Budd, EMBL Heidelberg Pairwise Sequence Alignment Raise your hands: Who has ever seen a "pairwise alignment" before? Who has ever used/encountered one in their research? Pairwise sequence alignments are a crucial component of many bioinformatic analyses and tools In particular they lie at the heart of the tools most commonly used to predict function of protein/DNA/RNA molecules i.e. to generate hypotheses for the function of key biological entities that can be tested by wet-lab experiments

8 Aidan Budd, EMBL Heidelberg Building Sequence Alignments

9 Aidan Budd, EMBL Heidelberg How They Used to Align Sequences... Courtesy of Geoff Barton, Dundee

10 Aidan Budd, EMBL Heidelberg Pairwise Alignments How we can build them today "by hand" using JalView Why it's useful to look at this, despite the many automatic methods Sometimes we're right and the automatic methods are wrong and we can spot this Useful to get you thinking about how we choose a good from a bad alingment, and to identify sequences that are easier of more difficult to align (as these are the same kinds of sequences that automatic tools find easy/difficult to align, so we know better what to expect from such tools

11 Aidan Budd, EMBL Heidelberg Pairwise Alignments How we can build them today "by hand" using JalView http://tinyurl.com/protBioinf2011 During exercises, write down: 1.Features that describe a relatively "good" alignment, thinking in terms of sizes of gaps numbers of gaps properties of residues in the same column (the same as each other? different?) 1.Instructions on how to change a "bad" alignment into a better one 2.Characteristics of sequences that are more difficult/take more time to align than others Describe what you've written to your neighbour - have you come up with the same answers?

12 Aidan Budd, EMBL Heidelberg Pairwise Alignments 1.Features that describe a relatively "good" alignment: small, few gaps residues in the same column either the same amino acid, or ones with similar physical/chemical property 1.We build a "good" alignment by inserting as few, as short as possible gaps into the alignment, so that as many columns as possible contain the same/similar residues 2.Sequences that are more difficult/take more time to align than others when: They are more divergent (e.g. when the best alignment contains fewer "identical" residues) They are longer They have very different lengths They are DNA rather than the corresponding coding sequences They contain "repeated" regions Part of the reason why these situations cause problems is that, rather than there clearly being one best alignment, we find several that are "equally" good (or bad)

13 Aidan Budd, EMBL Heidelberg 1:1 Relationships Between Residues in the Same Column Thinking in terms of (i) protein structure and (ii) evolution, what defines the relationship that you expect residues in the same column of a good/correct alignment to share with each other? Think about this on your own for a minute. Perhaps there's just a single word you might want to use to describe this sought-after relationship? Try and explain your answers to your neighbours Try and write down a definition of this relationship

14 Aidan Budd, EMBL Heidelberg Alternative Interpretations of MSAs (Evolutionary and Structural)

15 Aidan Budd, EMBL Heidelberg Structurally equivalent/similar Evolutionary equivalent/related/homologous Residues in the same column either: Different applications assume different types of equivalence Different types of similarity not necessarily equivalent “Equivalence”/similarity of residues

16 Aidan Budd, EMBL Heidelberg Structural Similarity Bacterial toxins 1ji6 and 1i5p1ji6 UnalignedStructurally Aligned

17 Aidan Budd, EMBL Heidelberg Structural Similarity Chain1: 68 ELIGLQANIREFNQQVDNF 1111111111111111111 Chain2: 70 ELQGLQNNFEDYVNALNSW Residues with a similar structural context may lie almost on top of each other within a structural alignment. Clearly, the dark green and red side chains have more similar structural contexts than they do with the adjacent light-coloured side chains

18 Aidan Budd, EMBL Heidelberg Chain 1: 16 KVGSLIGKR---ILSELWGIIFPSGST 111111111 11111111111 111 Chain 2: 16 VVGVPFAGALTSFYQSFLNTIWP-SDA Structural Similarity

19 Aidan Budd, EMBL Heidelberg Structural equivalence Some regions of the structures do not have structurally equivalent residues in the other structure 1i5p: DNFLNPTQN----PVPLSITSSVN 111111 111111111111ji6: NSWKKTPLSLRSKRSQDRIRELFS Alignment gaps are a sure sign of such residues Placing such residues in the same column as residues from other sequences is a misalignment - to be avoided!

20 Aidan Budd, EMBL Heidelberg “Evolutionary Equivalence” AGWYTI AGWWTI AGWYTI AGWWTI AAWYTI AGWWTI AAQQQWYTI Mutation / Substitution Y-W Substitution G-A QQQ Insertion AGWYTI Two copies of gene generated AGWYTI AGWWTI AGWYTI AGWWTI AAWYTI AG---WWTI AAQQQWYTI Residues in the same alignment column should trace their history back to the same residue in the ancestral sequence with any changes due only to point substitutions

21 Aidan Budd, EMBL Heidelberg Quiz - Evolutionary Interpretation of Alignments Which alignment of the final sequences (X, Y or Z) only places residues in the same column if they are related by substitution events? KGE--------PGIGL------PG KGIPG-----------DPAFGDPG RGIPGEVLGAQ-----------PG Z KGEPG------IGL------PG KGIPG---------DPAFGDPG RGIPGEVLGAQ---------PG Y KGEPG---IGLPG KGIPGDPAFGDPG RGIPGEVLGAQPG X

22 Aidan Budd, EMBL Heidelberg Quiz - Evolutionary Interpretation of Alignments KGE--------PGIGL------PG KGIPG-----------DPAFGDPG RGIPGEVLGAQ-----------PG "True" alignment given history described above RGIPGEVLGAQPG KGIPGDPAFGDP G ---KGEPGIGLPG PRANK

23 Aidan Budd, EMBL Heidelberg RGIPGEVLGAQPG KGIPGDPAFGDP G ---KGEPGIGLPG PRANK KGEPG---IGLPG KGIPGDPAFGDPG RGIPGEVLGAQPG MAFFT K---GEPGIGLPG KGIPGDPAFGDPG RGIPGEVLGAQPG CLUSTALX Different automatic MSA software gives different results All are different from the "true" alignment (assuming the scenario of transformation on the previous slide is true)...... because that scenario is very unlikely under the models of evolutionary transformation incorporated within these tools KGE--------PGIGL------PG KGIPG-----------DPAFGDPG RGIPGEVLGAQ-----------PG Z KGEPG------IGL------PG KGIPG---------DPAFGDPG RGIPGEVLGAQ---------PG Y KGEPG---IGLPG KGIPGDPAFGDPG RGIPGEVLGAQPG X Quiz - Evolutionary Interpretation of Alignments

24 Aidan Budd, EMBL Heidelberg Interpreting Alignments Special 1:1 relationship between residues in the same column Structural: very similar structural context Evolutionary: any difference between residues in the same column due to point substitution (not to any other kind of mutation e.g. deletion followed by insertion) Structural and Evolutionary equivalence need not necessarily be the same Not all residues have 1:1 equivalents in other sequences

25 Aidan Budd, EMBL Heidelberg nef1/fyn1 PDB:1efn Non-Equivalence of Evolutionary and Structural Alignments Demonstration 1: Structural equivalence without evolutionary equivalence Structural alignment of SH3-interaction motifs from nef and ncf1 ncf1/ncf4 PDB:1w70 aligned ncf1/nef1 SH3 interaction motifs

26 Aidan Budd, EMBL Heidelberg Non-Equivalence of Evolutionary and Structural Alignments Demonstration 2b: Sequences differ by ONE amino acid residue and have different folds Proc Natl Acad Sci U S A. 2009 Dec 15;106(50):21149-54. A minimal sequence code for switching protein structure and function.Alexander PA, He Y, Chen Y, Orban J, Bryan PN.PMID: 19923431Alexander PA, Or Bryan: 19923 G A 95G B 95

27 Aidan Budd, EMBL Heidelberg Quiz - Numbers of Insertions (a) 2(b) 1(c) 0(d) 3 The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?

28 Aidan Budd, EMBL Heidelberg Quiz - Numbers of Insertions If all sequences are the same length, we can explain their diversity without inferring ANY insertions or deletions If and alignment contains sequences that are all either length x or y, then we can explain their diversity by inferring just one insertion or deletion The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is?

29 Aidan Budd, EMBL Heidelberg Quiz - Numbers of Insertions The minimum number of insertion events required to account for the section of haemoglobin alignment shown above is? We can ALWAYS explain observed sequence length diversity with: 0 insertions (all length variation due to deletion) 0 deletions (all length variation due to insertion) a combination of insertions and deletions Perhaps we should instead focus on inferring the most likely scenario? (Although if this is not particularly relevant for our analysis, perhaps we should focus instead on something completely different!)

30 Aidan Budd, EMBL Heidelberg Identifying Good Alignments

31 Aidan Budd, EMBL Heidelberg Distinguishing Better from Worse Alignments An "objective" way of choosing which of two alignments (between the same pair of sequences) is "better" (more likely to be correct) would help us identify "good" alignments Scoring schemes/rules have been developed for this purpose These aim to assign higher scores to alignments that are more similar to true alignments the more similar an alignment is to that are more similar to "true" alignments should get a higher score than those similar means - more columns in which residues are found int eh asme column as each other in true alignments, more gaps of similar size and number as in true alignments NOTE - we are optimising against some "average" protein - thse may be different for different proteins, howver it's still demonstrably useful higher score lower score

32 Aidan Budd, EMBL Heidelberg Scoring Schemes for Aligned Residues SeqANAN SeqBN-G 3.80.4 SeqANAN SeqBNG- 3.80.5 score = 3.8 + 0.5 = 4.3 score = 3.8 + 0.4 = 4.2 Commonly, scoring schemes (for columns containing no gaps) assign a score for each column......the score for the full alignment is then the sum of the individual column scores Below we calculate, in this way, a score for two alignments of the same sequences For each column, the score assigned depends on which residues are found in the column Aligned residue scores are taken from an amino acid substitution matrix - see next slide...

33 Aidan Budd, EMBL Heidelberg Scoring Schemes for Aligned Residues SeqANAN SeqBN-G 3.80.4 SeqANAN SeqBNG- 3.80.5 score = 3.8 + 0.5 = 4.3 score = 3.8 + 0.4 = 4.2 Commonly, scoring schemes (for columns containing no gaps) assign a score for each column......the score for the full alignment is then the sum of the individual column scores Below we calculate, in this way, a score for two alignments of the same sequences For each column, the score assigned depends on which residues are found in the column Aligned residue scores are taken from an amino acid substitution matrix - see next slide...

34 Aidan Budd, EMBL Heidelberg Amino Acid Substitution Matrices Gonnet PAM250 SeqANAN SeqBN-G 3.80.4 SeqANAN SeqBNG- 3.80.5 score = 3.8 + 0.5 = 4.3 score = 3.8 + 0.4 = 4.2 Matrix provides a score for alignment of each different pair of amino acids

35 Aidan Budd, EMBL Heidelberg Amino Acid Substitution Matrices Gonnet PAM250 Values in some cells in this matrix are positive Although more cells contain negative values

36 Aidan Budd, EMBL Heidelberg Amino Acid Substitution Matrices Gonnet PAM250 Values are obtained analysing alignments between sequences that we are very confident: have similar 3D structures evolutionarily related by processes of: point substitution insertion deletion

37 Aidan Budd, EMBL Heidelberg Amino Acid Substitution Matrices One set of matrices (the BLOSUM series) are built by analysing many ungapped regions ("blocks") of alignments of fairly similar sequences count how often we see each pair of residues in the same column count how often each different aa is found in the datasets calculate which is more likely: residues are found in the same column by chance residues are found in the same column in alignments analysed numbers range between 0 and infinity < 1 - more likely by chance > 1 more likely that it's beca Block IPB000108A ID P67PHOX; BLOCK AC IPB000108A; distance from previous block=(7,386) DE Neutrophil cytosol factor 2 signature BL PR00499; width=18; seqs=27; 99.5%=1042; strength=1146 NCF2_BOVIN|O77775NCF2_BOVIN|O77775 ( 218) APLQPQAAEPPPRPKTPE 9 NCF2_HUMAN|P19878 ( 218) APLQPQAAEPPPRPKTPE 9 ( 218) APLQPQ O70145|O70145_MOUSE ( 218) APLQPQSAEPPPRPKTPE 10SAEPPPRPKTPE 10 YKYKA7_CAEEL|P34258 ( 117) IPLKEAFTALPPRPAAPS 40 Q8NFC7 ( 218) APLQPQAAEPPPRPKTPE 98NFC7 Q9BV51 ( 218) APLQPQAAEPPPRPKTPE 9 ( 21 Q95MN2|Q95MN2_RABIT ( 218) APLQPQAAEPPPRPKTPE 98) APL Q95L70|Q95L70_BISBI ( 218) APLQPQAAEPPPRPKTPE 9QPQAAEPPPRPKTPE 9 Q9N0E9|Q9N0E9_TURTR ( 218) APLQPQAAEPPPRPKTPE 9 Q6GMC8|Q6GMC8_XENQ6GMC8|Q6GMC8_XENLA ( 219) APLQPQANNPPSRPKTPE 22 Q59F14|Q59F14_HUMAN ( 246) APLQPQVRQSDLLGAQAG 95MAN ( 246) APLQPQV Q61QT5|Q61QT5_CAEBR ( 237) IPLKEAFSAPPPRPAAPS 37APPPRPAAPS 37 Q5R5Q5R5J0|Q5R5J0_PONPY ( 218) APLQPQAAEPPPRPKTPE 9 O08635|O08635_MOUSE ( 20) QIFKNQDPVLPPRPKPGH 205|O08635_MOUSE ( Q3TC92|Q3TC92_MOUSE ( 232) APLQPQSAEPPPRPKTPE 1032) APLQPQSAEPPPRPK Q3U5S4|Q3U5S4_MOUSE ( 218) APLQPQSAEPPPRPKTPE 10TPE 10 Q6DFH8|Q6DFQ6DFH8|Q6DFH8_XENLA ( 52) YVIKRQQPDLPPRPKPGH 43 Q499C5|Q499C5_XENTR ( 219) APLQPQASNPPPRPKTPE 13C5_XENTR ( 219) AP Q60FB5|Q60FB5_ONCMY ( 218) APLQPQVEEVPTRPKVPE 25LQPQVEEVPTRPKVPE 2 Q32N10|Q32N10_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 1616 Q5HYK7|Q5HYK7_HUQ5HYK7|Q5HYK7_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 16 Q5U3B8|Q5U3B8_HUMAN ( 20) QVFKNQDPVLPPRPKPGH 16MAN ( 20) QVFKNQD Q9UFC8|Q9UFC8_HUMAN ( 8) QVFKNQDPVLPPRPKPGH 16PVLPPRPKPGH 16 Q4RQ4R8R2|Q4R8R2_MACFA ( 20) QVFKNQDPVLPPRPKPGH 16 O35146|O35146_MOUSE ( 10) KIVTPLDERPRGRPNDSG 100146|O35146_MOUSE ( Q1PCS1|Q1PCS1_MUSMM ( 218) APLQPQSAEPPPRPKTPE 10218) APLQPQSAEPPPRP Q1A3S2|Q1A3S2_9PERO ( 218) APLQPQVEDGPTRPKEPE 27KEPE 27 // http://blocks.fhcrc.org/blocks-bin/getblock.pl#IPB000108A Values are obtained analysing alignments between sequences that we are very confident: have similar 3D structures evolutionarily related by processes of: point substitution insertion deletion

38 Aidan Budd, EMBL Heidelberg Amino Acid Substitution Matrices count how often we see each pair of residues in the same column count how often each different aa is found in the datasets calculate which is more likely: residues are found in the same column by chance residues are found in the same column in alignments analysed numbers range between 0 and infinity < 1 - more likely by chance > 1 more likely that it's beca Block IPB000108A ID P67PHOX; BLOCK AC IPB000108A; distance from previous block=(7,386) DE Neutrophil cytosol factor 2 signature BL PR00499; width=18; seqs=27; 99.5%=1042; strength=1146 NCF2_BOVIN|O77775NCF2_BOVIN|O77775 ( 218) APLQPQAAEPPPRPKTPE 9 NCF2_HUMAN|P19878 ( 218) APLQPQAAEPPPRPKTPE 9 ( 218) APLQPQ O70145|O70145_MOUSE ( 218) APLQPQSAEPPPRPKTPE 10SAEPPPRPKTPE 10 YKYKA7_CAEEL|P34258 ( 117) IPLKEAFTALPPRPAAPS 40 Q8NFC7 ( 218) APLQPQAAEPPPRPKTPE 98NFC7 Q9BV51 ( 218) APLQPQAAEPPPRPKTPE 9 ( 21 Q95MN2|Q95MN2_RABIT ( 218) APLQPQAAEPPPRPKTPE 98) APL Q95L70|Q95L70_BISBI ( 218) APLQPQAAEPPPRPKTPE 9QPQAAEPPPRPKTPE 9 Q9N0E9|Q9N0E9_TURTR ( 218) APLQPQAAEPPPRPKTPE 9 Q6GMC8|Q6GMC8_XENQ6GMC8|Q6GMC8_XENLA ( 219) APLQPQANNPPSRPKTPE 22 Q59F14|Q59F14_HUMAN ( 246) APLQPQVRQSDLLGAQAG 95MAN ( 246) APLQPQV Q61QT5|Q61QT5_CAEBR ( 237) IPLKEAFSAPPPRPAAPS 37APPPRPAAPS 37 Q5R5Q5R5J0|Q5R5J0_PONPY ( 218) APLQPQAAEPPPRPKTPE 9 O08635|O08635_MOUSE ( 20) QIFKNQDPVLPPRPKPGH 205|O08635_MOUSE ( Q3TC92|Q3TC92_MOUSE ( 232) APLQPQSAEPPPRPKTPE 1032) APLQPQSAEPPPRPK Q3U5S4|Q3U5S4_MOUSE ( 218) APLQPQSAEPPPRPKTPE 10TPE 10 Q6DFH8|Q6DFQ6DFH8|Q6DFH8_XENLA ( 52) YVIKRQQPDLPPRPKPGH 43 Q499C5|Q499C5_XENTR ( 219) APLQPQASNPPPRPKTPE 13C5_XENTR ( 219) AP Q60FB5|Q60FB5_ONCMY ( 218) APLQPQVEEVPTRPKVPE 25LQPQVEEVPTRPKVPE 2 Q32N10|Q32N10_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 1616 Q5HYK7|Q5HYK7_HUQ5HYK7|Q5HYK7_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 16 Q5U3B8|Q5U3B8_HUMAN ( 20) QVFKNQDPVLPPRPKPGH 16MAN ( 20) QVFKNQD Q9UFC8|Q9UFC8_HUMAN ( 8) QVFKNQDPVLPPRPKPGH 16PVLPPRPKPGH 16 Q4RQ4R8R2|Q4R8R2_MACFA ( 20) QVFKNQDPVLPPRPKPGH 16 O35146|O35146_MOUSE ( 10) KIVTPLDERPRGRPNDSG 100146|O35146_MOUSE ( Q1PCS1|Q1PCS1_MUSMM ( 218) APLQPQSAEPPPRPKTPE 10218) APLQPQSAEPPPRP Q1A3S2|Q1A3S2_9PERO ( 218) APLQPQVEDGPTRPKEPE 27KEPE 27 // http://blocks.fhcrc.org/blocks-bin/getblock.pl#IPB000108A Key parameters estimated in analysis: frequency (q ij ) with which residues i and j are found in the same column (averaged over all analysed blocks) frequency with which each pair of residues are found in same column if all sequences randomised where p i is the frequency with which residue i is present in the complete dataset, this is: p i * p j If two residues i and j are more often found in the same column in blocks compared to randomised sequences then q ij / p i * p j > 1 and log(q ij / p i * p j ) > 0

39 Aidan Budd, EMBL Heidelberg Amino Acid Substitution Matrices count how often we see each pair of residues in the same column count how often each different aa is found in the datasets calculate which is more likely: residues are found in the same column by chance residues are found in the same column in alignments analysed numbers range between 0 and infinity < 1 - more likely by chance > 1 more likely that it's beca Block IPB000108A ID P67PHOX; BLOCK AC IPB000108A; distance from previous block=(7,386) DE Neutrophil cytosol factor 2 signature BL PR00499; width=18; seqs=27; 99.5%=1042; strength=1146 NCF2_BOVIN|O77775NCF2_BOVIN|O77775 ( 218) APLQPQAAEPPPRPKTPE 9 NCF2_HUMAN|P19878 ( 218) APLQPQAAEPPPRPKTPE 9 ( 218) APLQPQ O70145|O70145_MOUSE ( 218) APLQPQSAEPPPRPKTPE 10SAEPPPRPKTPE 10 YKYKA7_CAEEL|P34258 ( 117) IPLKEAFTALPPRPAAPS 40 Q8NFC7 ( 218) APLQPQAAEPPPRPKTPE 98NFC7 Q9BV51 ( 218) APLQPQAAEPPPRPKTPE 9 ( 21 Q95MN2|Q95MN2_RABIT ( 218) APLQPQAAEPPPRPKTPE 98) APL Q95L70|Q95L70_BISBI ( 218) APLQPQAAEPPPRPKTPE 9QPQAAEPPPRPKTPE 9 Q9N0E9|Q9N0E9_TURTR ( 218) APLQPQAAEPPPRPKTPE 9 Q6GMC8|Q6GMC8_XENQ6GMC8|Q6GMC8_XENLA ( 219) APLQPQANNPPSRPKTPE 22 Q59F14|Q59F14_HUMAN ( 246) APLQPQVRQSDLLGAQAG 95MAN ( 246) APLQPQV Q61QT5|Q61QT5_CAEBR ( 237) IPLKEAFSAPPPRPAAPS 37APPPRPAAPS 37 Q5R5Q5R5J0|Q5R5J0_PONPY ( 218) APLQPQAAEPPPRPKTPE 9 O08635|O08635_MOUSE ( 20) QIFKNQDPVLPPRPKPGH 205|O08635_MOUSE ( Q3TC92|Q3TC92_MOUSE ( 232) APLQPQSAEPPPRPKTPE 1032) APLQPQSAEPPPRPK Q3U5S4|Q3U5S4_MOUSE ( 218) APLQPQSAEPPPRPKTPE 10TPE 10 Q6DFH8|Q6DFQ6DFH8|Q6DFH8_XENLA ( 52) YVIKRQQPDLPPRPKPGH 43 Q499C5|Q499C5_XENTR ( 219) APLQPQASNPPPRPKTPE 13C5_XENTR ( 219) AP Q60FB5|Q60FB5_ONCMY ( 218) APLQPQVEEVPTRPKVPE 25LQPQVEEVPTRPKVPE 2 Q32N10|Q32N10_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 1616 Q5HYK7|Q5HYK7_HUQ5HYK7|Q5HYK7_HUMAN ( 387) QVFKNQDPVLPPRPKPGH 16 Q5U3B8|Q5U3B8_HUMAN ( 20) QVFKNQDPVLPPRPKPGH 16MAN ( 20) QVFKNQD Q9UFC8|Q9UFC8_HUMAN ( 8) QVFKNQDPVLPPRPKPGH 16PVLPPRPKPGH 16 Q4RQ4R8R2|Q4R8R2_MACFA ( 20) QVFKNQDPVLPPRPKPGH 16 O35146|O35146_MOUSE ( 10) KIVTPLDERPRGRPNDSG 100146|O35146_MOUSE ( Q1PCS1|Q1PCS1_MUSMM ( 218) APLQPQSAEPPPRPKTPE 10218) APLQPQSAEPPPRP Q1A3S2|Q1A3S2_9PERO ( 218) APLQPQVEDGPTRPKEPE 27KEPE 27 // http://blocks.fhcrc.org/blocks-bin/getblock.pl#IPB000108A Key parameters estimated in analysis: frequency (q ij ) with which residues i and j are found in the same column (averaged over all analysed blocks) frequency with which each pair of residues are found in same column if all sequences randomised where p i is the frequency with which residue i is present in the complete dataset, this is: p i * p j q ij / p i * p j < 1 and log(q ij / p i * p j ) < 0 If two residues i and j are less often found in the same column in blocks compared to randomised sequences then

40 Aidan Budd, EMBL Heidelberg aa Substitution Matrix Values What range of scores would you expect alignments that are more similar to alignments found in blocks, compared to alignments between random sequences, to have: A.scores < 0 B.0 < scores < 1 C.0 > scores Values in matrix cells are proportional to log(q ij / p i * p j ) SeqANAN SeqBNG- 3.80.5 score = 3.8 + 0.5 = 4.3

41 Aidan Budd, EMBL Heidelberg Sequence Similarity Searching Query sequence Database of sequences... Build optimal alignment between query sequence and each database sequence Calculate score for each optimal alignment For each alignment, calculate how many alignments between randomised (structurally dissimilar/evolutionarily unrelated) sequences you would expect with a score the same or greater than the score for this alignment This number is the E-value

42 Aidan Budd, EMBL Heidelberg Interpreting E-values E-value of 0.001 means that you would expect this score to be found for an alignment between randomised sequences once in every 1000 similar searches For each of values below, decide whether it it impossible that it be reported as the result of such a search: A.1 B.0 C.0.00001 D.1000000000 E.-1

43 Aidan Budd, EMBL Heidelberg Interpreting E-values You have carried out a BLAST search with a query sequence seqA against a database dbB, such that there are no sequences in dbB that are "related" to seqA What is the most likely E-value for the highest-scoring alignment?

44 Aidan Budd, EMBL Heidelberg Interpreting E-values You have carried out a BLAST search with a query sequence seqA against a database dbB, such that there are no sequences in dbB that are "related" to seqA You carry out several variants of this search, as described below. For each search, decide: the most likely value of the E-value for the highest-scoring alignment the most likely fraction of the score of highest-scoring alignment compared to that of the initial search (5 times larger? 5 times smaller etc.) delete the second half of seqA, search against dbB query seqA against a database (dbC) which contains two copies of every sequence in dbB query seqA against a database (dbD) from which every second sequence in dbB has been removed from


Download ppt "Aidan Budd, EMBL Heidelberg Pairwise Alignments and Sequence Similarity-Based Searching."

Similar presentations


Ads by Google