Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Similar presentations


Presentation on theme: "Introduction to Bioinformatics From Pairwise to Multiple Alignment."— Presentation transcript:

1 Introduction to Bioinformatics From Pairwise to Multiple Alignment

2 Outline Advances in BLAST Multiple Sequence Alignment- CLUSTAL

3 Scoring system for BLAST Substitution Matrix + Gap Penalty

4 Substitution Matrix BLOSUM matrices are based on the replacement patterns found in more highly conserved regions of the sequences without gaps PAM matrices based on mutations observed throughout a global alignment, includes both highly conserved and highly mutable regions

5 Gap penalty Example showed -1 score per indel –So gap cost is proportional to its length Biologically, indels occur in groups –We want our gap score to reflect this Standard solution: affine gap model –Once-off cost for opening a gap –Lower cost for extending the gap –Changes required to algorithm

6 Statistical significance

7 E-value The number of hits (with the same similarity score) one can "expect" to see just by chance when searching the given string in a database of a particular size. higher e-value lower similarity –“ sequences with E-value of less than 0.01 are almost always found to be homologous” The lower bound is normally 0 (we want to find the best)

8 Expectation Values Increases linearly with length of query sequence Increases linearly with length of database Decreases exponentially with score of alignment

9 Remote homologues Sometimes BLAST isn’t enough. Large protein family, and BLAST only gives close members. We want more distant members PSI-BLAST

10 Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results

11 PSI-BLAST Advantage: PSI-BLAST looks for seqs that are close to ours, and learns from them to extend the circle of friends Disadvantage: if we found a WRONG sequence, we will get to unrelated sequences. This gets worse and worse each iteration

12 Multiple Sequence Alignment MSA

13 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG-- Like pairwise alignment BUT compare n sequences instead of 2 Rows represent individual sequences Columns represent ‘same’ position May be gaps in some sequences

14 Why multiple alignments? BLAST Usually obtains many sequences that are significantly similar to the query sequence Practically Comparing each and every sequence to every other may impractical when the number of sequences is large Solution generating a profile

15 MSA MSA can give you a better picture of functional sites on proteins and nucleic acids as well as the forces that shape evolution! VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGSSSNIGS--ITVNWYQQLPG LRLSCTGSGFIFSS--YAMYWYQQAPG LSLTCTGSGTSFDD-QYYSTWYQQPPG Important amino acids or nucleotides are not allowed to mutate Less important positions change more easily

16 Alignment Example GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC GTCGTAGTCGGCTCGAC GTCTAGCGAGCGTGAT GCGAAGAGGCGAGC GCCGTCGCGTCGTAAC 1*1 2*0.75 11*0.5 Score=8 4*1 11*0.75 2*0.5 Score=13.25 Score : 4/4 =1, 3/4 =0.75, 2/4=0.5, 1/4= 0

17 Example of 3 sequences:

18 Dynamic Programming Pairwise A–B alignment table –Cell (i,j) = score of best alignment between first i elements of A and first j elements of B –Complexity: length of A  length of B 3-way A–B–C alignment table –Cell (i,j,k) = score of best alignment between first i elements of A, first j of B, first k of C –Complexity: length A  length B  length C Example: protein family alignment –100 proteins, 1000 amino acids each –Complexity: 10 300 table cells –Calculation time: beyond the big bang!

19 Feasible Approach Based on pairwise alignment scores –Build n by n table of pairwise scores Align similar sequences first –After alignment, consider as single sequence –Continue aligning with further sequences

20 –For n sequences, there are n  (n-1)/2 pairs GTCGTAGTCG-GC-TCGAC GTC-TAG-CGAGCGT-GAT GC-GAAG-AG-GCG-AG-C GCCGTCG-CG-TCGTA-AC

21 1 GTCGTAGTCG-GC-TCGAC 2 GTC-TAG-CGAGCGT-GAT 3 GC-GAAGAGGCG-AGC 4 GCCGTCGCGTCGTAAC 1 GTCGTA-GTCG-GC-TCGAC 2 GTC-TA-G-CGAGCGT-GAT 3 G-C-GAAGA-G-GCG-AG-C 4 G-CCGTCGC-G-TCGTAA-C

22 CLUSTAL method Higgins and Sharp 1988 –ref: CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 73, 237–244. [Medline][Medline] An approximation strategy (heuristic algorithm) yields a possible alignment, but not necessarily the best one Progressive Sequence Alignment

23 ABCDABCD DCBA A 11B 13C 1022D Compute the pairwise alignments for all against all the similarities are stored in a table First step:

24 DCBA A 11B 13C 1022D A D C B cluster the sequences to create a tree Represents the order in which pairs of sequences are to be alignedRepresents the order in which pairs of sequences are to be aligned similar sequences are neighbors in the treesimilar sequences are neighbors in the tree distant sequences are distant from each other in the treedistant sequences are distant from each other in the tree Second step:

25 N Y L SN Y L S N K Y L SN F S N F L SN F L S N K/- Y L S N F L/- S N K/- Y/F L/- S Join alignments

26 Treating Gaps in ClustalW Penalty for opening gaps and additional penalty for extending the gap Gaps found in initial alignment remain fixed New gaps are introduced as more sequences are added (decreased penalty if gap exists) Decreased within stretches of hydrophilic residues

27 MSA Approaches Progressive approach CLUSTALW (CLUSTALX) http://www.ebi.ac.uk/clustalw/ PILEUP T-COFFEE Iterative approach: Repeatedly realign subsets of sequences. MultAlin, DiAlign. Statistical Methods: Hidden Markov Models SAM2K Genetic algorithm SAGA


Download ppt "Introduction to Bioinformatics From Pairwise to Multiple Alignment."

Similar presentations


Ads by Google