Download presentation
Presentation is loading. Please wait.
Published bySilas Tye Modified over 9 years ago
1
Computational Molecular Biology Biochem 218 – BioMedical Informatics 231 http://biochem218.stanford.edu/ http://biochem218.stanford.edu/ Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Multiple Sequence Alignment
2
Homework 4: Sequence Alignment & Search Part 1) Examining effect of gap penalties on sequence alignment Please choose two proteins of interest which are only 30 to 50% identical and which have gaps in their alignment. The easiest way to find two such proteins is to choose one protein of interest to you and use it as a query in a BLAST search (either with UniProt or with NCBI BLAST) and then choose the second protein with a score 30 to 50% of the maximal score. Ensure that the alignment of these two proteins contains a few gaps.UniProt NCBI BLAST Align your two protein sequences using SeqWeb's Bestfit or the SIM Alignment tool. Repeat the alignment with the same two sequences using the gap penalties 4, 8, 16, 32, and 64. Keep the ratio of the gap penalty to the gap extension penalty the same in all cases (a 4 to 1 ratio is fine).Bestfit SIM Alignment Examine the alignments and describe the effect of raising the gap penalty on the number and arrangements of gaps seen in these alignments. Mention which of the alignments has the highest overall bit score (or quality) and comment if the gaps disrupt a biologically important site (i.e. do they interrupt known functional motifs or structural features revealed by InterPro or MyHits?). Which gap penalties give the most biological alignment in your opinion. Please show the alignments to support your conclusions.
3
Homework 4: Sequence Alignment & Search Part 2) Comparing Smith-Waterman, UNGAPPED and GAPPED BLAST searches. Find a protein-family containing only 50-100 members by examining different motifs in the PROSITE database.. You should print out and keep your "gold standard family list from from UniProt. Choose one sequence of the family as a query and use the Decypher supercomputer to perform UnGAPPED Tera-BLAST, Banded SW Tera-BLAST and standard Smith- Waterman algorithm searches of the UniProt/SwissProt database. Make sure that in each case you collect at least 100 sequences in your result set (or twice the expected number of family members). Also be sure to turn query filtering OFF (the default is ON). Now, compare the three searches using the Receiver-Operator-Characteristic (ROC) curve. For each search, draw a line across each output list after every 10th sequence and count the number of true positives above that line and the number of false positives above that line. Remember that the gold standard determines whether a sequence is a true positive. Continue until you have collected at least 50 false positives (ROC50 curve). Finally, plot the number of True Positive sequences versus the number of False Positive sequences on a two dimensional graph for each search. Do the three searches have identical shaped curves? Is one curve higher than the other? If so, which search is the best for your protein family? Please send the actual ROC curves (either a graphics file or an EXCEL spreadsheet with the graphs in place) to support your conclusions. Send the results to homework218@cmgm.Stanford.EDU.
4
Evaluation of Search Algorithms Negatives Positives TN TP FNFP Sensitivity= TP/(TP+FN) Specificity= TN/(TN+FP)
5
Evaluation of Search Algorithms with Receiver-Operator Characteristic Curve Area Under Curve (AUC) Number True Positives Number False Positives0 0 100 200
6
Pyruvate Dehydrogenase E1 Family (EC 1.2.4.1) http://uniprot.org/ http://uniprot.org/
7
Decypher Home Page http://decypher.stanford.edu/ http://decypher.stanford.edu/
8
Decypher Search Input http://decypher.stanford.edu/ http://decypher.stanford.edu/
9
Vary the Similarity Threshold
10
General DNA Similarity Search Principles Search both Strands Translate ORFs and cDNAs Use most sensitive search possible –UnGapped BLAST for infinite gap penalty (PCR & CHIP oligos) –Gapped BLAST for most searches –Smith Waterman or megaBLAST or discontinuous MegaBLAST for cDNA/genome comparisons –cDNA =>Zero gap-length penalty –Consider using transition matrices –Ensure that expected value of score is negative Examine results with exp. between 0.05 and 10 Reevaluate results of borderline significance using limited query
11
General Protein Similarity Search Principles Chose between local or global search algorithm Use most sensitive search algorithm available –Original BLAST for no gaps –Smith-Waterman for most flexibility –Gapped BLAST for well delimited regions –PSI-BLAST for families –Initially BLOSUM62 and default gap penalties –If no significant results, use BLOSUM30 and lower gap penalties Examine results between exp. 0.05 and 10 for biological significance Beware of long hits or those with unusual amino acid composition Reevaluate results of borderline significance using limited query
12
Goals of Multiple Sequence Alignment Determine Consensus Sequences –Prosite Patterns Building Gene Families –InterPro, Prints, ProDom, pFAM, DOMO, COGs, KOGs Develop Relationships & Phylogenies –Clusters, COGs, KOGs, ClusTR –Relationships –Evolutionary Models –UPGMA, Neighbor Joining, Phylip, GrowTree, PAUP Model Protein Structures for Threading and Fold Prediction –Profiles, Templates, HSSP, FSSP, SwissModel –Hidden Markov Models, pFAM, SAM, SuperFamily –Network Models, Neural Nets, Bayesian Networks –Statistical Models, Generalized Linear Models
13
Consensus Sequence From a Multiple Sequence Alignment
14
Block Maker Makes a Multiple Sequence Alignment http://blocks.fhcrc.org/blocks/blockmkr/make_blocks.html http://blocks.fhcrc.org/blocks/blockmkr/make_blocks.html NLQGYMLGNP NFMGYMVGNG NLKGFLVGNA NLKGILIGNA NLKGFAIGNG NFKGYLVGNG NLKGFIVGNP NIKGYIQGNA NLKGFMIGNA NLQGYILGNP NFKGFMVGNA NLQGYVLGNP 10-45 PLLLWLNGGPGCSSIGYGASEEIG PLVLWFNGGPGCSSVGFGAFEELG PLMIWLTGGPGCSGLSSFVYEIGP PLMIWLTGGPGCSGLSTFLYEFGP PLLLWLSGGPGCSSLTGLLFENGP PLVLWLNGGPGCSSVAYGAAEEIG PVVIWLTGGPGCSSELALFYENGP PLVIWFNGGPGCSSLGGAFKELGP PLVIWFNGGPACSSLGGAFLELGP PLVLWLNGGPGCSSLYGAFQELGP PLVLWLNGGPGCSSIAYGASEEVG PLTLWLNGGPGCSSVGGGAFTELG 25-55 TVKQWSGYMDYKDS GVNQYSGYLSVGSN SFAHYAGYVTVSED DFAQYAGYVTVDAA DLGHHAGYYKLPKS SVESYSGFMTVDAK GVKSYTGYLLANAT NFKQYSGYYNVGTK NFKSYSGYVDANAN NFKHYSGFFQVSDN DFFHYSGYLRAWTD TVKQYTGYLDVEDD 40
15
Sequence Profiles http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=hmmerpfam http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=hmmerpfam
16
SeqWeb Sequence Profile Search http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=hmmerpfam http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=hmmerpfam
17
Hidden Markov Models http://www.cse.ucsc.edu/research/compbio/sam.html http://www.cse.ucsc.edu/research/compbio/sam.html AA1AA2AA3AA4AA5AA6 I 1 I 2 I 3 I 4 I 5 D 2D 3D 4D 5
18
Evolutionary Trees X Y Z R1 1 3 5 4 2 2 1 X 3 4 5 Z Y R2 21 X Z R1 R2 5 3 4 Y
19
Challenges Aligning Multiple Sequences Computational complexity O(n k ) for k sequences n long Space requirements O(n k ) for k sequences n long Sequence clusters require weighting function Weighted alignments tend to overweight erroneous sequences Approximations must be used for real world data –Linked lists used to find exact words shared between k sequences –BLAST can find inexact shared words between k sequences –FASTA can be used to do progressive pair-wise alignments –HMM Pair models find best overall alignment probabilistically Pairwise comparisons followed by Progressive Alignments Final alignment is often dependent on order data presented Gaps make alignment unnaturally long
20
Three Protein Alignment (Murata, Richardson & Sussman)
21
One Pairwise Alignment from the Three-Way Alignment
22
All Pairwise Alignments from the Three-Way Alignment
23
Carrillo-Lipman Limits for MSA http://searchlauncher.bcm.tmc.edu/multi-align/multi-align.html http://searchlauncher.bcm.tmc.edu/multi-align/multi-align.html
24
Clustal Progressive Alignment (Step 1)
25
Clustal Progressive Alignment (Step 2)
26
DNPYIVRMIGICEAE- SWM DHPNIIRLEGVVTKSRPV M DNPYIVRMIGICEAESWM QHPRLVRLYAVVTQEPIY DNPYIVRMIGICEAE-SWM DHPNIIRLEGVVTKSRPVM DNPYIVRMIGICEAESWM QHPRLVRLYAVVTQEPIY SEQ2:MQQL-DNPYIVRMIGICEAE-SWM SEQ4:MKMIGKHKNIINLLGACTQDGPLY SEQ2:MQQL SEQ1:MGQF SEQ2:MQQL SEQ3:MKQL SWM PIY Gaps Are Propagated To Make Alignment
27
Clustal Procedure
28
Clustal Dendrogram
29
Clustal Globin Alignment
30
ClustalW Step 1: BLOSUM Distance Matrix ClustalW Step 2: Dendrogram
31
ClustalW Sequence Weighting
32
ClustalW Residue Specific Gap Penalties
33
Position Specific Gap Penalties
34
ClustalW Step 3: Progressive Alignment
35
T-Coffee Procedure
36
Regular Progressive Alignment
37
T-Coffee Primary Alignment Library
38
T-Coffee Extended Alignment Library and Progressive Alignment
39
Comparison of T-Coffee to Other MSAs
40
MUSCLE Edgar (2004) NAR 32, 1792-1797 Edgar (2004) NAR 32, 1792-1797
41
MUSCLE Edgar (2004) NAR 32, 1792-1797 Edgar (2004) NAR 32, 1792-1797
42
MUSCLE BaliBase Test Edgar (2004) NAR 32, 1792-1797 Edgar (2004) NAR 32, 1792-1797
43
ProbCons Do, et al. Genome Research 15, 330-340 Do, et al. Genome Research 15, 330-340
44
ProbCons Do, et al. Genome Research 15, 330-340 Do, et al. Genome Research 15, 330-340 Step 1: Compute posterior probability matrices of each pair of aligned sequences from the pair- HMM model Step 2: Compute expected accuracies of pairwise alignments. Step 3: Probabilistic Consistency Transformation Step 4: Calculate Guide Tree using UPGMA from measure of similarities of sequence pairs Step 5: Progressive alignment Step 6: refinment by dividing sequences into two groups and re-align. Repeat multiple times.
45
ProbCons Do, et al. Genome Research 15, 330-340 Do, et al. Genome Research 15, 330-340
46
ProbCons Do, et al. Genome Research 15, 330-340 Do, et al. Genome Research 15, 330-340
47
ProbCons Do, et al. Genome Research 15, 330-340 Do, et al. Genome Research 15, 330-340
48
SeqWeb ClustalW http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=clustalw-prot http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=clustalw-prot
49
SeqWeb ClustalW MSA Parameters http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=clustalw-prot http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=clustalw-prot
50
SeqWeb ClustalW Alignment http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=clustalw-prot http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=clustalw-prot
51
SeqWeb ClustalW Text Output http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=clustalw-prot http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=clustalw-prot
52
SeqWeb Pileup Input http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pileup-prot http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pileup-prot
53
SeqWeb Pileup Input http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pretty-prot http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pretty-prot
54
SeqWeb Pileup Dendrogram http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pileup-prot http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pileup-prot
55
SeqWeb Pretty Input http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pretty-prot http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pretty-prot
56
SeqWeb Pretty Alignment and Consensus http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pileup-prot http://seqweb.stanford.edu:81/gcg-bin/analysis.cgi?program=pileup-prot
57
Decypher ClustalW Input http://decypher.stanford.edu/ http://decypher.stanford.edu/
58
Decypher ClustalW Results http://decypher.stanford.edu/ http://decypher.stanford.edu/
59
Decypher ClustalW Results http://decypher.stanford.edu/ http://decypher.stanford.edu/
60
ClustalW @ EBI Input http://www.ebi.ac.uk/clustalw/ http://www.ebi.ac.uk/clustalw/
61
ClustalW @ EBI Results http://www.ebi.ac.uk/clustalw/ http://www.ebi.ac.uk/clustalw/
62
ClustalW @ EBI Results http://www.ebi.ac.uk/clustalw/ http://www.ebi.ac.uk/clustalw/
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.