Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,

Similar presentations


Presentation on theme: "Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,"— Presentation transcript:

1 Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information, see http://ich.vscht.cz/~svozil/teaching.html http://ich.vscht.cz/~svozil/teaching.html

2 Last lecture summary

3 Flavors of sequence alignment Homology Scoring DNA alignment, gaps Substitution matrix Scoring protein alignment PAM matrices, PAM1, higher PAM

4 New stuff

5 Protein substitution matrices – BLOSUM

6 BLOSUM matrices I BLOck SUbstitution Matrix by Henikoff and Henikoff, 1992. They used the BLOCKS database containing multiple alignments of ungapped segments (blocks). These alignments correspond to the most highly conserved regions of proteins. Blocks are ungapped sequence motifs. Sequence motif is a conserved stretch of amino acids confering a specific function to a protein. Any given protein can contain one or more blocks corresponding to its structural/functional motifs.

7 Blocks......

8 BLOSUM matrices II Thus the Hanikoffs focused on substitution patterns only in the most conserved regions of a protein. These regions are (presumably) least prone to change. The substitution patterns of 2000 blocks (block is the whole alignment, not individual sequences within it) representing more than 500 groups were examined, and BLOSUM matrices were generated. Sequences sharing no more than 62% identity were used to calculate BLOSUM62 matrix. Short and clear explanation of BLOSUM62 derivation: Eddy SR. Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol. 2004 22(8):1035-6. PMID: 15286655.

9 BLOSUM matrices III BLOSUM matrices are based on entirely different type of sequence analysis (local ungapped alignment vs. global gapped alignment in PAM) and on a much larger data set than PAM. All BLOSUM matrices are based on observed alignments. They are not based on extrapolations like PAM. BLOSUM numbering system goes in reversing order as the PAM numbering system. The lower the BLOSUM number, the more divergent sequence they represent.

10 PAM vs. BLOSUM I However, you may ask a question which particular matrix should be used? Dayhoff et al. (1978) defined terms protein families and superfamilies. A protein family is formed by sequences 85% (or greater) identical to each other. A protein superfamily is defined as sequences related from 30% or greater. Superfamily may clearly contain many families. These terms are widely used in contemporary literature, however with different meanings (we’ll come to that later). Guidance in the choice of scoting matrix: Wheeler D. Selecting the right protein-scoring matrix. Curr Protoc Bioinformatics. 2002;Chapter 3:Unit 3.5. www.nshtvn.org/ebook/molbio/Current%20Protocols/CPB/bi0305.pdfwww.nshtvn.org/ebook/molbio/Current%20Protocols/CPB/bi0305.pdf

11 PAM vs. BLOSUM II – PAM At the time of deriving PAM matrices, most known proteins were small, globular and hydrophilic. If resercher believes his protein contain substantial hydrophobic regions, PAM matrices are not that useful. Most widely used is PAM250. It is capable of detecting similarities in the 30% range (i.e. superfamilies). Another point of view – PAM250 provides the best look- back in evolutionary time. PAM250 is most effective if the goal is to know the widest possible range of proteins similar to the given protein.

12 PAM vs. BLOSUM III – PAM Assume a protein is a known member of the serine protease family. Using the protein as a query against protein databases with PAM 250 will detect virtually all serine proteases, but also considerable amount of irrelevant hits. In this case, the PAM160 matrix should be used. It detects similarities in the 50% to 60% range (Altschul, 1991). And to find only those proteins most similar (70% - 90%) to the query protein, use PAM40. Let’s summarize: Locate all potential similarities – PAM250 Determine if the protein belongs to the protein family – PAM160 Determine the most similar proteins – PAM40

13 PAM vs. BLOSUM IV – BLOSUM Most widely used is BLOSUM62. BLOSUM62 appears to be superior to PAM250 in detecting distant relationships even if the PAM method is updated with current data sets. BLOSUM62 is capable of accurately detecting similarities down to the 30% range (superfamilies). Determine if the protein belongs to protein family – BLOSUM80 (detects identities at the 50% level) Determine the most similar proteins – BLOSUM90

14 Selecting an Appropriate Matrix MatrixBest useSimilarity (%) Pam40Short highly similar alignments70-90 PAM160Detecting members of a protein family50-60 PAM250Longer alingments of more divergent sequences~30 BLOSUM90Short highly similar alignments70-90 BLOSUM80Detecting members of a protein family50-60 BLOSUM62Most effective in finding all potential similarities30-40 BLOSUM30Longer alingments of more divergent sequences<30 Similarity column gives range of similarities that the matrix is able to best detect.

15 PAM vs. BLOSUM V – comparison Careful information theory analysis showed that the following matrices are equivalent: PAM250 is equivalent to BLOSUM45 PAM160 is equivalent to BLOSUM62 PAM120 is equivalent to BLOSUM80 Compared to the PAM160 matrix, BLOSUM62 is less tolerant to substitutions involving hydrophilic amino acids, and more tolerant to substitutions involving hydrophobic amino acids. Although both PAM250 and BLOSUM62 detect similarities at the 30% level, since BLOSUM uses much wider range of proteins, PAM250 is actually equivalent to BLOSUM45 when considering all proteins, not just those that are hydrophilic.

16 Sequence alignment algorithms

17 Pairwise alignment algorithms Dot plot (dot matrix) Graphical way of comparing two sequences Dynamic programming Slow, but formally optimizing Heuristic methods Efficient, but not as thorough Word (also k-tuples) methods Used in database searches

18 Dot plot

19 Graphical method that allows the comparison of two biological sequences and identify regions of close similarity between them. Also used for finding direct or inverted repeats in sequences. Or for prediction regions in RNA that are self- complementary and therefore have potential to form secondary structures.

20

21 Self-similarity dot plot I The DNA sequence EU127468.1 compared against itself. Introduction to dot-plots, Jan Schulz http://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction-to-dot-plots&catid=52:cat_coding_algorithms_dot-plots&Itemid=76

22 runs of matched residues gap background noise

23 Self-similarity dot plot II Introduction to dot-plots, Jan Schulz http://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction-to-dot-plots&catid=52:cat_coding_algorithms_dot-plots&Itemid=76 The DNA sequence EU127468.1 compared against itself. Window size = 16. Linear color mapping

24 Improving dot plot Sliding window – window size (lets say 11) Stringency (lets say 7) – a dot is printed only if 7 out of the next 11 positions in the sequence are identical Color mapping Scoring matrices can be used to assign a score to each substitution. These numbers then can be converted to gray/color.

25 Interpretation of dot plot I 1. Plot two homologous sequences of interest. If they are similar – diagonal line will occur (matches). 2. frame shifts a) mutations gaps in diagonal b) insertions shift of main diagonal c) deletions shift of main diagonal http://ugene.unipro.ru/documentation/manual/plugins/dotplot/interpret_a_dotplot.html

26 Interpretation of dot plot II Identify repeat regions (direct repeats, inverted repeats) – lines parallel to the diagonal line in self-similarity plot Microsattelites and minisattelites (these are also called low-complexity regions) can be identified as “squares”. Palindromatic sequences are shown as lines perpendicular to the main diagonal. Plaindromatic sequence: V ELIPSE SPI LEV Bioinformatics explained: Dot plots, http://www.clcbio.com/index.php?id=1330&manual=BE_Dot_plots.html

27 Repeats in dot plot from the book Bioinformatics, David. M. Mount, direct repeats minisattelites inverted repeats self-similarity dot plot of NA sequence ofhuman LDL receptor window 23, stringency 7

28 Interpretation of dot plot – summary http://www.code10.info/index.php?option=com_content&view=article&id=64:inroduction-to-dot-plots&catid=52:cat_coding_algorithms_dot-plots&Itemid=76

29 Dot plot of the human genome A. M. Campbell, L. J. Heyer, Discovering genomics, proteomics and bioinformatics

30 Dot plot rules Larger windows size is used for DNA sequences because the number of random matches is much greater due to the presence of only four characters in the alphabet. A typical window size for DNA is 15, with stringency 10. For proteins the matrix has not to be filtered at all, or windows 2 or 3 with stringency 2 can be used. If two proteins are expected to be related but to have long regions of dissimilar sequence with only a small proportion of identities, such as similar active sites, a large window, e.g., 20, and small stringency, e.g., 5, should be useful for seeing any similarity.

31 Dot plot advantages/disadvantages Advantages: All possible matches of residues between two sequences are found. It’s just up to you to choose the most significant ones. Readily reveals the presence of insertions/deletions and direct and inverted repeats that are more difficult to find by the other, more automated methods. Disadvantages: Most dot matrix computer programs do not show an actual alignment. Does not return a score to indicate how ‘optimal’ a given alignment is (no statistical significance that could be tested).

32 Dynamic programming

33 Dynamic programming (DP) General class of algorithms typically applied to optimization problems. Recursive approach. Original problem is broken into smaller subproblems and then solved. Pieces of larger problem have a sequential dependency. 4 th piece can be solved using solution of the 3 rd piece, the 3 rd piece can be solved by using solution of the 2 nd piece and so on…

34 We want to align two following sequences: ABCDE PQRST If you already have the optimal solution to: A…D P…R then you know the next pair of characters will be either: A…DE or A…D- or A…DE P…RS P…RS P…R- You can extend the match by determining which of these has the highest score.

35 Sequence B Sequence A Best previous alignment New best alignment = previous best + local best...

36 DP algorithms Global alignment - Needlman-Wunsch Local alignment - Smith-Waterman Guaranteed to provide the optimal alignment. Disadvantages: Slow due to the very large number of computational steps: O(n 2 ). Computer memory requirement also increases with the square of the sequence lengths. Therefore, it is difficult to use the method for very long sequences. Many alignments may give the same optimal score. And none of these correspond to the biologically correct alignment.


Download ppt "Lecture invitation 8.4. 2015 AI, 14:30 (we will finish earlier this day, at 14:20) How Lab IT Accelerates Pharmaceutical Research For more information,"

Similar presentations


Ads by Google