Download presentation
Presentation is loading. Please wait.
1
Bioinformatics sequence alignment Blast
From Pevsner, Jonathan-<Bioinformatics and Functional Genomics>-Wiley-Blackwell (2015)
2
Outline: pairwise alignment
Overview and examples Definitions: homologs, paralogs, orthologs Assigning scores to aligned amino acids: Dayhoff’s PAM matrices Alignment algorithms: Needleman-Wunsch, Smith-Waterman
3
Outline part2 BLAST Practical use Algorithm Strategies
Finding distantly related proteins: PSI-BLAST Hidden Markov models(Bao’s lecture) BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
4
Learning objectives Define homologs, paralogs, orthologs
Perform pairwise alignments (NCBI BLAST) Understand how scores are assigned to aligned amino acids using Dayhoff’s PAM matrices Explain how the Needleman-Wunsch algorithm performs global pairwise alignments 4
5
Pairwise alignments in the 1950s
b-corticotropin促肾上腺皮质激素 (sheep) Corticotropin A (pig) ala gly glu asp asp glu asp gly ala glu asp glu CYIQNCPLG CYFQNCPRG Oxytocin Vasopressin Page 46
6
Early example of sequence alignment: globins (1961)
myoglobin Early example of sequence alignment: globins (1961) H.C. Watson and J.C. Kendrew, “Comparison Between the Amino-Acid Sequences of Sperm Whale Myoglobin and of Human Hæmoglobin.” Nature 190: , 1961.
7
Pairwise sequence alignment is the most
fundamental operation of bioinformatics • It is used to decide if two proteins (or genes) are related structurally or functionally • It is used to identify domains or motifs that are shared between proteins It is the basis of BLAST searching • It is used in the analysis of genomes Page 47
8
Pairwise alignment: protein sequences can be more informative than DNA
• protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties • codons are degenerate: changes in the third position often do not alter the amino acid that is specified • protein sequences offer a longer “look-back” time DNA sequences can be translated into protein, and then used in pairwise alignments
9
Page 54
10
Pairwise alignment: protein sequences can be more informative than DNA
Many times, DNA alignments are appropriate --to confirm the identity of a cDNA --to study noncoding regions of DNA --to study DNA polymorphisms --example: Neanderthal vs modern human DNA Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247
11
Outline: pairwise alignment
Overview and examples Definitions: homologs, paralogs, orthologs Assigning scores to aligned amino acids: Dayhoff’s PAM matrices Alignment algorithms: Needleman-Wunsch, Smith-Waterman 11
12
Definition: pairwise alignment
The process of lining up two sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Page 53
13
Definition: homology Homology
Similarity attributed to descent from a common ancestor. Page 49
14
myoglobin (NP_005359) 2MM1 Beta globin (NP_000509) 2HHB Page 49
15
Definitions: two types of homology
Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. Paralogs Homologous sequences within a single species that arose by gene duplication. Page 49
16
You can view these sequences at www.bioinfbook.org (document 3.1)
Orthologs: members of a gene (protein) family in various organisms. This tree shows globin orthologs. You can view these sequences at (document 3.1) Page 51
17
Paralogs: members of a gene (protein) family within a
species. This tree shows human globin paralogs. Page 52
18
Orthologs and paralogs are often viewed in a single tree
Source: NCBI
19
General approach to pairwise alignment
Choose two sequences Select an algorithm that generates a score Allow gaps (insertions, deletions) Score reflects degree of similarity Alignments can be global or local Estimate probability that the alignment occurred by chance
20
Calculation of an alignment score
Source:
21
Find BLAST from the home page of NCBI and select protein BLAST…
22
Choose align two or more sequences…
Page 52
23
Enter the two sequences (as accession numbers or in the fasta format) and click BLAST.
Optionally select “Algorithm parameters” and note the matrix option.
24
Pairwise alignment result of human beta globin and myoglobin
Myoglobin RefSeq Information about this alignment: score, expect value, identities, positives, gaps… Query = HBB Subject = MB Middle row displays identities; + sign for similar matches Page 53
25
Pairwise alignment result of human beta globin and myoglobin: the score is a sum of match, mismatch, gap creation, and gap extension scores Page 53
26
V matching V earns +4 These scores come from
Pairwise alignment result of human beta globin and myoglobin: the score is a sum of match, mismatch, gap creation, and gap extension scores V matching V earns +4 These scores come from T matching L earns -1 a “scoring matrix”! Page 53
27
Definitions: homology and identity
Similarity attributed to descent from a common ancestor. Identity The extent to which two (nucleotide or amino acid) sequences are invariant. Page 50
28
Definition: similarity
The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Identity The extent to which two sequences are invariant. Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue. Page 51
29
Definition: pairwise alignment
The process of lining up two sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Page 53
30
First gap position scores -11 Second gap position scores -1
Mind the gaps First gap position scores -11 Second gap position scores -1 Gap creation tends to have a large negative score; Gap extension involves a small penalty Page 55
31
Gaps • Positions at which a letter is paired with a null
are called gaps. • Gap scores are typically negative. • Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. Thus there are separate penalties for gap creation and gap extension. • In BLAST, it is rarely necessary to change gap values from the default.
32
Pairwise alignment of retinol-binding protein and b-lactoglobulin:
Example of an alignment with internal, terminal gaps 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | | | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV QYSC 136 RBP || || | :.|||| | | 94 IPAVFKIDALNENKVL VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI lactoglobulin
33
Pairwise alignment of retinol-binding protein
from human (top) and rainbow trout (O. mykiss): Example of an alignment with few gaps 1 .MKWVWALLLLA.AWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48 :: || || || .||.||. .| :|||:.|:.| |||.||||| 1 MLRICVALCALATCWA...QDCQVSNIQVMQNFDRSRYTGRWYAVAKKDP 47 49 EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98 |||| ||:||:|||||.|.|.||| ||| :||||:.||.| ||| || | 48 VGLFLLDNVVAQFSVDESGKMTATAHGRVIILNNWEMCANMFGTFEDTPD 97 99 PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148 ||||||:||| ||:|| ||||||::||||| ||: |||| ..||||| | 98 PAKFKMRYWGAASYLQTGNDDHWVIDTDYDNYAIHYSCREVDLDGTCLDG 147 149 YSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIVHNGYCDGRSERNLL 199 |||:||| | || || |||| :..|:| .|| : | |:|: 148 YSFIFSRHPTGLRPEDQKIVTDKKKEICFLGKYRRVGHTGFCESS
34
Pairwise sequence alignment allows us
to look back billions of years ago (BYA) Origin of life Earliest fossils Origin of eukaryotes Eukaryote/ archaea Fungi/animal Plant/animal insects 4 3 2 1 When you do a pairwise alignment of homologous human and plant proteins, you are studying sequences that last shared a common ancestor 1.5 billion years ago! Page 56
35
Multiple sequence alignment of
glyceraldehyde 3-phosphate dehydrogenases: 甘油醛3-磷酸酯脱氢酶example of extremely high conservation fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA Page 57
36
Multiple sequence alignment of human lipocalin paralogs:
example of extremely low conservation ~~~~~EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM lipocalin 1 LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSPVKVTALGGGNLEATFTF odorant-binding protein 2a TKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIVLHR progestagen-assoc. endo. VQENFDVNKYLGRWYEIEKIPTTFENGRCIQANYSLMENGNQELRADGTV apolipoprotein D VKENFDKARFSGTWYAMAKDPEGLFLQDNIVAEFSVDETGNWDVCADGTF retinol-binding protein LQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKSYNVTSVLF neutrophil gelatinase-ass. VQPNFQQDKFLGRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL prostaglandin D2 synthase VQENFNISRIYGKWYNLAIGSTCPWMDRMTVSTLVLGEGEAEISMTSTRW alpha-1-microglobulin PKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD complement component 8 Page 57
37
Outline: pairwise alignment
Overview and examples Definitions: homologs, paralogs, orthologs Assigning scores to aligned amino acids: Dayhoff’s PAM matrices Alignment algorithms: Needleman-Wunsch, Smith-Waterman 37
38
lys found at 58% of arg sites
Emile Zuckerkandl and Linus Pauling (1965) considered substitution frequencies in 18 globins (myoglobins and hemoglobins from human to lamprey八目鳗). Black: identity Gray: very conservative substitutions (>40% occurrence) White: fairly conservative substitutions (>21% occurrence) Red: no substitutions observed Page 93
39
Page 93
40
scoring matrix that assigns scores…
Where we’re heading: to a PAM250 log odds scoring matrix that assigns scores… Page 69 40
41
…and to a whole series of scoring matrices such as PAM10
Page 69 41
42
Dayhoff’s 34 protein superfamilies
Protein PAMs per 100 million years Ig kappa chain 37 Kappa casein 33 luteinizing hormone b 30 lactalbumin complement component epidermal growth factor 26 proopiomelanocortin 21 pancreatic ribonuclease 21 haptoglobin alpha 20 serum albumin 19 phospholipase A2, group IB 19 prolactin carbonic anhydrase C 16 Hemoglobin a 12 Hemoglobin b 12 Page 59
43
Dayhoff’s 34 protein superfamilies
Protein PAMs per 100 million years Ig kappa chain 37 Kappa casein酪蛋白 33 luteinizing hormone b 30 lactalbumin complement component epidermal growth factor 26 proopiomelanocortin 21 pancreatic ribonuclease 21 haptoglobin alpha 20 serum albumin 19 phospholipase A2, group IB 19 prolactin carbonic anhydrase C 16 Hemoglobin a 12 Hemoglobin b 12 human (NP_005203) versus mouse (NP_031812)
44
Dayhoff’s 34 protein superfamilies
Protein PAMs per 100 million years apolipoprotein A-II lysozyme gastrin myoglobin nerve growth factor myelin basic protein thyroid stimulating hormone b 7.4 parathyroid hormone parvalbumin trypsin insulin calcitonin arginine vasopressin adenylate kinase Page 59
45
Dayhoff’s 34 protein superfamilies
Protein PAMs per 100 million years triosephosphate isomerase vasoactive intestinal peptide 2.6 glyceraldehyde phosph. dehydrogease 2.2 cytochrome c collagen troponin C, skeletal muscle 1.5 alpha crystallin B chain 1.5 glucagon glutamate dehydrogenase 0.9 histone H2B, member Q 0.9 ubiquitin 泛素 0 Page 59
46
Pairwise alignment of human (NP_005203)
versus mouse (NP_031812) ubiquitin
47
Dayhoff’s approach to assigning scores for any two aligned amino acid residues
Dayhoff et al. defined the score of two aligned residues i,j as 10 times the log of how likely it is to observe these two residues (based on the empirical observation of how often they are aligned in nature) divided by the background probability of finding these amino acids by chance. This provides a score for each pair of residues. Page 58
48
Dayhoff’s numbers of “accepted point mutations”:
what amino acid substitutions occur in proteins? Dayhoff (1978) p.346. Page 61
49
Multiple sequence alignment of
glyceraldehyde 3-phosphate dehydrogenases: columns of residues may have high or low conservation fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA Page 57
50
The relative mutability of amino acids
Asn His 66 Ser Arg 65 Asp Lys 56 Glu Pro 56 Ala Gly 49 Thr 97 Tyr 41 Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 Val 74 Trp 18 Page 63
51
The relative mutability of amino acids
Asn His 66 Ser Arg 65 Asp Lys 56 Glu Pro 56 Ala Gly 49 Thr 97 Tyr 41 Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 Val 74 Trp 18 Note that alanine is normalized to a value of 100. Trp and cys are least mutable. Asn and ser are most mutable. Page 63
52
Normalized frequencies of amino acids
Gly 8.9% Arg 4.1% Ala 8.7% Asn 4.0% Leu 8.5% Phe 4.0% Lys 8.1% Gln 3.8% Ser 7.0% Ile 3.7% Val 6.5% His 3.4% Thr 5.8% Cys 3.3% Pro 5.1% Tyr 3.0% Glu 5.0% Met 1.5% Asp 4.7% Trp 1.0% blue=6 codons; red=1 codon These frequencies fi sum to 1 Page 63
53
Page 64
54
Dayhoff’s mutation probability matrix
for the evolutionary distance of 1 PAM We have considered three kinds of information: a table of number of accepted point mutations (PAMs) relative mutabilities of the amino acids normalized frequencies of the amino acids in PAM data This information can be combined into a “mutation probability matrix” in which each element Mij gives the probability that the amino acid in column j will be replaced by the amino acid in row i after a given evolutionary interval (e.g. 1 PAM). Page 63
55
Dayhoff’s PAM1 mutation probability matrix
Original amino acid Page 66
56
Dayhoff’s PAM1 mutation probability matrix
Each element of the matrix shows the probability that an original amino acid (top) will be replaced by another amino acid (side) Page 66
57
Substitution Matrix A substitution matrix contains values proportional
to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. The two major types of substitution matrices are PAM and BLOSUM.
58
Point-accepted mutations
PAM matrices: Point-accepted mutations PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids. All the PAM data come from closely related proteins (>85% amino acid identity). Page 63
59
Dayhoff’s PAM1 mutation probability matrix
Page 66
60
Dayhoff’s PAM0 mutation probability matrix:
the rules for extremely slowly evolving proteins Top: original amino acid Side: replacement amino acid Page 68
61
Dayhoff’s PAM2000 mutation probability matrix:
the rules for very distantly related proteins PAM A Ala R Arg N Asn D Asp C Cys Q Gln E Glu G Gly 8.7% 4.1% N 4.0% D 4.7% C 3.3% Q 3.8% E 5.0% G 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% Top: original amino acid Side: replacement amino acid Page 68
62
PAM250 mutation probability matrix
Top: original amino acid Side: replacement amino acid Page 68
63
PAM250 log odds scoring matrix Page 69
64
Why do we go from a mutation probability matrix to a log odds matrix?
We want a scoring matrix so that when we do a pairwise alignment (or a BLAST search) we know what score to assign to two aligned amino acid residues. Logarithms are easier to use for a scoring system. They allow us to sum the scores of aligned residues (rather than having to multiply them). Page 69
65
How do we go from a mutation probability matrix to a log odds matrix?
The cells in a log odds matrix consist of an “odds ratio”: the probability that an alignment is authentic the probability that the alignment was random The score S for an alignment of residues a,b is given by: S(a,b) = 10 log10 (Mab/pb) As an example, for tryptophan, S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4 Page 69
66
Normalized frequencies of amino acids
Arg 4.1% Asn 4.0% Phe 4.0% Gln 3.8% Ile 3.7% His 3.4% Cys 3.3% Tyr 3.0% Met 1.5% Trp 1.0%
67
What do the numbers mean
in a log odds matrix? S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4 A score of +17 for tryptophan means that this alignment is 50 times more likely than a chance alignment of two Trp residues. S(a,b) = 17 Probability of replacement (Mab/pb) = x Then 17 = 10 log10 x 1.7 = log10 x 101.7 = x = 50 Page 58
68
What do the numbers mean
in a log odds matrix? A score of +2 indicates that the amino acid replacement occurs 1.6 times as frequently as expected by chance. A score of 0 is neutral. A score of –10 indicates that the correspondence of two amino acids in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids. Page 58
69
PAM250 log odds scoring matrix Page 58
70
PAM10 log odds scoring matrix Page 59
71
More conserved Less conserved Rat versus mouse globin Rat versus bacterial globin
72
two nearly identical proteins
two distantly related proteins page 72
73
BLOSUM Matrices BLOSUM matrices are based on local alignments.
BLOSUM stands for blocks substitution matrix. BLOSUM62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. Page 70 73
74
BLOSUM Matrices 100 collapse Percent amino acid identity 62 30
74
75
BLOSUM Matrices 100 100 100 collapse collapse 62 62 62 collapse
Percent amino acid identity 30 30 30 BLOSUM80 BLOSUM62 BLOSUM30 75
76
BLOSUM Matrices All BLOSUM matrices are based on observed alignments;
they are not extrapolated from comparisons of closely related proteins. The BLOCKS database contains thousands of groups of multiple sequence alignments. BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. Page 72 76
77
Blosum62 scoring matrix Page 73 77
78
Point-accepted mutations
PAM matrices: Point-accepted mutations PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids. All the PAM data come from closely related proteins (>85% amino acid identity). Page 74 78
79
Two randomly diverging protein sequences change
in a negatively exponential fashion Percent identity “twilight zone” Evolutionary distance in PAMs Page 74 79
80
At PAM1, two proteins are 99% identical
At PAM10.7, there are 10 differences per 100 residues At PAM80, there are 50 differences per 100 residues At PAM250, there are 80 differences per 100 residues Percent identity “twilight zone” Differences per 100 residues Page 75 80
81
PAM: “Accepted point mutation”
Two proteins with 50% identity may have 80 changes per 100 residues. (Why? Because any residue can be subject to back mutations.) Proteins with 20% to 25% identity are in the “twilight zone” and may be statistically significantly related. PAM or “accepted point mutation” refers to the “hits” or matches between two sequences (Dayhoff & Eck, 1968) Page 75 81
82
Percent identity between two proteins: What percent is significant?
100% 80% 65% 30% 23% 19% We will see in the BLAST lecture that it is appropriate to describe significance in terms of probability (or expect) values. As a rule of thumb, two proteins sharing > 30% over a substantial region are usually homologous. 82
83
Outline: pairwise alignment
Overview and examples Definitions: homologs, paralogs, orthologs Assigning scores to aligned amino acids: Dayhoff’s PAM matrices Alignment algorithms: Needleman-Wunsch, Smith-Waterman 83
84
Two kinds of sequence alignment:
global and local We will first consider the global alignment algorithm of Needleman and Wunsch (1970). We will then explore the local alignment algorithm of Smith and Waterman (1981). Finally, we will consider BLAST, a heuristic version of Smith-Waterman. We will cover BLAST in detail Next time. Page 76 84
85
Global alignment with the algorithm of Needleman and Wunsch (1970)
• Two sequences can be compared in a matrix along x- and y-axes. • If they are identical, a path along a diagonal can be drawn • Find the optimal subpaths, and add them up to achieve the best score. This involves --adding gaps when needed --allowing for conservative substitutions --choosing a scoring system (simple or complicated) N-W is guaranteed to find optimal alignment(s) Page 76 85
86
Three steps to global alignment with the Needleman-Wunsch algorithm
[1] set up a matrix [2] score the matrix [3] identify the optimal alignment(s) Page 76 86
87
Four possible outcomes in aligning two sequences
1 2 [1] identity (stay along a diagonal) [2] mismatch (stay along a diagonal) [3] gap in one sequence (move vertically!) [4] gap in the other sequence (move horizontally!) Page 77 87
88
Page 77 88
89
Start Needleman-Wunsch with an identity matrix
Page 77 89
90
Start Needleman-Wunsch with an identity matrix
Page 77 90
91
Fill in the matrix using “dynamic programming”
Page 78 91
92
Fill in the matrix using “dynamic programming”
Page 78 92
93
Fill in the matrix using “dynamic programming”
Page 78 93
94
Fill in the matrix using “dynamic programming”
Page 78 94
95
Fill in the matrix using “dynamic programming”
Page 78 95
96
Fill in the matrix using “dynamic programming”
Page 78 96
97
Fill in the matrix using “dynamic programming”
Page 78 97
98
Traceback to find the optimal (best) pairwise alignment
Page 79 98
99
Needleman-Wunsch: dynamic programming
N-W is guaranteed to find optimal alignments, although the algorithm does not search all possible alignments. It is an example of a dynamic programming algorithm: an optimal path (alignment) is identified by incrementally extending optimal subpaths. Thus, a series of decisions is made at each step of the alignment to find the pair of residues with the best score. Page 80 99
100
Try using needle to implement a Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps): Page 81 100
101
Queries: beta globin (NP_000509) alpha globin (NP_000549) 101
102
102
103
Global alignment versus local alignment
Global alignment (Needleman-Wunsch) extends from one end of each sequence to the other. Local alignment finds optimally matching regions within two sequences (“subsequences”). Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences. Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Other methods (BLAST, FASTA) are faster but less thorough. Page 82 103
104
How the Smith-Waterman algorithm works
Set up a matrix between two proteins (size m+1, n+1) No values in the scoring matrix can be negative! S > 0 The score in each cell is the maximum of four values: [1] s(i-1, j-1) + the new score at [i,j] (a match or mismatch) [2] s(i,j-1) – gap penalty [3] s(i-1,j) – gap penalty [4] zero Page 82 104
105
Smith-Waterman algorithm allows the alignment of subsets of sequences
Page 83 105
106
Try using SSEARCH to perform a rigorous Smith-Waterman local alignment: 106
107
Queries: beta globin (NP_000509) alpha globin (NP_000549) 107
108
108
109
Rapid, heuristic versions of Smith-Waterman:
FASTA and BLAST Smith-Waterman is very rigorous and it is guaranteed to find an optimal alignment. But Smith-Waterman is slow. It requires computer space and time proportional to the product of the two sequences being aligned (or the product of a query against an entire database). Gotoh (1982) and Myers and Miller (1988) improved the algorithms so both global and local alignment require less time and space. FASTA and BLAST provide rapid alternatives to S-W. Page 84 109
110
Statistical significance of pairwise alignment
We will discuss the statistical significance of alignment scores in the next lecture (BLAST). A basic question is how to determine whether a particular alignment score is likely to have occurred by chance. According to the null hypothesis, two aligned sequences are not homologous (evolutionarily related). Can we reject the null hypothesis at a particular significance level alpha? 110
111
Pairwise alignment: key points
Pairwise alignments allow us to describe the percent identity two sequences share, as well as the percent similarity The score of a pairwise alignment includes positive values for exact matches, and other scores for mismatches and gaps PAM and BLOSUM matrices provide a set of rules for assigning scores. PAM10 and BLOSUM80 are examples of matrices appropriate for the comparison of closely related sequences. PAM250 and BLOSUM30 are examples of matrices used to score distantly related proteins. Global and local alignments can be made.
112
BLAST BLAST (Basic Local Alignment Search Tool)
allows rapid sequence comparison of a query sequence against a database. The BLAST algorithm is fast, accurate, and web-accessible.
113
Why use BLAST? BLAST searching is fundamental to understanding
the relatedness of any favorite query sequence to other known proteins or DNA sequences. Applications include identifying orthologs and paralogs discovering new genes or proteins discovering variants of genes or proteins investigating expressed sequence tags (ESTs) exploring protein structure and function
114
Four components to a BLAST search
(1) Choose the sequence (query) (2) Select the BLAST program (3) Choose the database to search (4) Choose optional parameters Then click “BLAST”
116
Step 1: Choose your sequence
Sequence can be input in FASTA format or as accession number
117
Example of the FASTA format for a BLAST query
118
Step 2: Choose the BLAST program
119
Step 2: Choose the BLAST program
blastn (nucleotide BLAST) blastp (protein BLAST) tblastn (translated BLAST) blastx (translated BLAST) tblastx (translated BLAST)
120
Choose the BLAST program
Program Input Database 1 blastn DNA DNA blastp protein protein 6 blastx DNA protein tblastn protein DNA 36 tblastx DNA DNA
122
DNA potentially encodes six proteins
5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG
123
Step 3: choose the database
nr = non-redundant (most general database) dbest = database of expressed sequence tags dbsts = database of sequence tag sites gss = genomic survey sequences htgs = high throughput genomic sequence
124
Step 4a: Select optional search parameters
organism Entrez! algorithm
125
Step 4a: optional blastp search parameters
Expect Word size Scoring matrix Filter, mask
126
Step 4a: optional blastn search parameters
Expect Word size Match/mismatch scores Filter, mask
127
BLAST: optional parameters
You can... • choose the organism to search • turn filtering on/off • change the substitution matrix • change the expect (e) value • change the word size • change the output format
128
(a) Query: human insulin NP_000198
Program: blastp Database: C. elegans RefSeq Default settings: Unfiltered (“composition-based statistics”) Our starting point: search human insulin against worm RefSeq proteins by blastp using default parameters
129
(b) Query: human insulin NP_000198
Program: blastp Database: C. elegans RefSeq Option: No compositional adjustment Note that the bit score, Expect value, and percent identity all change with the “no compositional adjustment” option
130
(c) Query: human insulin NP_000198
Program: blastp Database: C. elegans RefSeq Option: conditional compositional score matrix adjustment Note that the bit score, Expect value, and percent identity all change with the compositional score matrix adjustment
131
(d) Query: human insulin NP_000198
Program: blastp Database: C. elegans RefSeq Option: Filter low complexity regions Note that the bit score, Expect value, and percent identity all change with the filter option
132
(the filtered sequence is the query in lowercase and grayed out)
(e) Query: human insulin NP_000198 Program: blastp Database: C. elegans RefSeq Option: Mask for lookup table only Filtering (the filtered sequence is the query in lowercase and grayed out)
133
(e) Query: human insulin NP_000198
Program: blastp Database: C. elegans RefSeq Option: Mask for lookup table only Note that the bit score, Expect value, and percent identity could change with the “mask for lookup table only” option
134
Step 4b: optional formatting parameters
Alignment view Descriptions Alignments
135
BLAST format options
136
BLAST search output: multiple alignment format
138
BLAST search output: top portion
database query program taxonomy
139
taxonomy
140
BLAST search output: graphical output
141
BLAST search output: tabular output
High scores low E values Cut-off: .05? 10-10?
142
BLAST search output: alignment output
143
Outline of today’s lecture
BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
144
BLAST: background on sequence alignment
There are two main approaches to sequence alignment: [1] Global alignment (Needleman & Wunsch 1970) using dynamic programming to find optimal alignments between two sequences. (Although the alignments are optimal, the search is not exhaustive.) Gaps are permitted in the alignments, and the total lengths of both sequences are aligned (hence “global”).
145
BLAST: background on sequence alignment
[2] The second approach is local sequence alignment (Smith & Waterman, 1980). The alignment may contain just a portion of either sequence, and is appropriate for finding matched domains between sequences. BLAST is a heuristic approximation to local alignment. It examines only part of the search space.
146
How a BLAST search works
“The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T.” Altschul et al. (1990)
147
How the original BLAST algorithm works:
three phases Phase 1: compile a list of word pairs (w=3) above threshold T Example: for a human RBP query …FSGTWYA… (query word is in yellow) A list of words (w=3) is: FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS
148
Phase 1: compile a list of words (w=3)
GTW 6,5,11 22 neighborhood GSW 6,1,11 18 word hits ATW 0,5,11 16 > threshold NTW 0,5,11 16 GTY 6,5,2 13 GNW 10 neighborhood GAW 9 word hits < below threshold (T=11)
149
Pairwise alignment scores are determined using a scoring matrix such asBlosum62
150
How a BLAST search works: 3 phases
Scan the database for entries that match the compiled list. This is fast and relatively easy.
151
How a BLAST search works: 3 phases
Phase 3: when you manage to find a hit (i.e. a match between a “word” and a database entry), extend the hit in either direction. Keep track of the score (use a scoring matrix) Stop when the score drops below some cutoff. KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) extend extend Hit!
152
How a BLAST search works: 3 phases
In the original (1990) implementation of BLAST, hits were extended in either direction. In a 1997 refinement of BLAST, two independent hits are required. The hits must occur in close proximity to each other. With this modification, only one seventh as many extensions occur, greatly speeding the time required for a search.
153
How a BLAST search works: threshold
You can modify the threshold parameter. The default value for blastp is 11. To change it, enter “-f 16” or “-f 5” in the advanced options of BLAST+. (To find BLAST+ go to BLAST help download.)
155
Phase 1: compile a list of words (w=3)
GTW 6,5,11 22 neighborhood GSW 6,1,11 18 word hits ATW 0,5,11 16 > threshold NTW 0,5,11 16 GTY 6,5,2 13 GNW 10 neighborhood GAW 9 word hits < below threshold (T=11)
156
For blastn, the word size is typically 7, 11, or 15 (EXACT match)
For blastn, the word size is typically 7, 11, or 15 (EXACT match). Changing word size is like changing threshold of proteins. w=15 gives fewer matches and is faster than w=11 or w=7. For megablast (see below), the word size is 28 and can be adjusted to 64. What will this do? Megablast is VERY fast for finding closely related DNA sequences!
157
How to interpret a BLAST search: expect value
It is important to assess the statistical significance of search results. For global alignments, the statistics are poorly understood. For local alignments (including BLAST search results), the statistics are well understood. The scores follow an extreme value distribution (EVD) rather than a normal distribution.
158
normal probability distribution x 0.40 0.35 0.30 0.25 0.20 0.15 0.10
0.05 -5 -4 -3 -2 -1 1 2 3 4 5 x
159
The probability density function of the extreme
value distribution (characteristic value u=0 and decay constant l=1) 0.40 0.35 0.30 0.25 normal distribution extreme value distribution probability 0.20 0.15 0.10 0.05 -5 -4 -3 -2 -1 1 2 3 4 5 x
160
How to interpret a BLAST search: expect value
The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p. The key equation describing an E value is: E = Kmn e-lS
161
E = Kmn e-lS This equation is derived from a description of the extreme value distribution S = the score E = the expect value = the number of high- scoring segment pairs (HSPs) expected to occur with a score of at least S m, n = the length of two sequences l, K = Karlin Altschul statistics
162
Some properties of the equation E = Kmn e-lS
The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). Very high scores correspond to very low E values. The E value for aligning a pair of random sequences must be negative! Otherwise, long random alignments would acquire great scores Parameter K describes the search space (database). For E=1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly
163
From raw scores to bit scores
There are two kinds of scores: raw scores (calculated from a substitution matrix) and bit scores (normalized scores) Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ = bit score = (lS - lnK) / ln2 The E value corresponding to a given bit score is: E = mn 2 -S’ Bit scores allow you to compare results between different database searches, even using different scoring matrices.
164
How to interpret BLAST: E values and p values
The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. A p value is a different way of representing the significance of an alignment. p = 1 - e-E
165
How to interpret BLAST: E values and p values
Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpret than corresponding p values. E p (about 0.1) (about 0.05) (about 0.001)
166
How to interpret BLAST: overview
167
word size w = 3 10 is the E value gap penalties BLOSUM matrix threshold score = 11 length of database
168
EVD parameters 147 – 111 = 36 m n mn Effective search space = mn = length of query x db length
169
Why set the E value to 20,000? Suppose you perform a search with a short query (e.g. 9 amino acids). There are not enough residues to accumulate a big score (or a small E value). Indeed, a match of 9 out of 9 residues could yield a small score with an E value of 100 or 200. And yet, this result could be “real” and of interest to you. By setting the E value cutoff to 20,000 you do not change the way the search was done, but you do change which results are reported to you.
170
Outline of today’s lecture
BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
171
BLAST search strategies
General concepts How to evaluate the significance of your results How to handle too many results How to handle too few results BLAST searching with HIV-1 pol, a multidomain protein
172
Sometimes a real match has an E value > 1
…try a reciprocal BLAST to confirm
173
Sometimes a similar E value occurs for a
short exact match and long less exact match short, nearly exact long, only 31% identity, similar E value
174
Assessing whether proteins are homologous
RBP4 and PAEP: Low bit score, E value 0.49, 24% identity (“twilight zone”). But they are indeed homologous. Try a BLAST search with PAEP as a query, and find many other lipocalins.
175
The universe of lipocalins (each dot is a protein)
retinol-binding protein odorant-binding protein apolipoprotein D
176
BLAST search with PAEP as a query finds many other lipocalins
177
Using human beta globin as a query, here are the blastp results searching against human RefSeq proteins (PAM30 matrix). Where is myoglobin? It’s absent! We need to use PSI-BLAST.
178
Outline of today’s lecture
BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
179
Two problems standard BLAST cannot solve
[1] Use human beta globin as a query against human RefSeq proteins, and blastp does not “find” human myoglobin. This is because the two proteins are too distantly related. PSI-BLAST at NCBI as well as hidden Markov models easily solve this problem. [2] How can we search using 10,000 base pairs as a query, or even millions of base pairs? Many BLAST-like tools for genomic DNA are available such as PatternHunter, Megablast, BLAT, and BLASTZ. 179
180
Specialized BLAST servers
Organism-specific BLAST sites Molecule-specific BLAST sites Specialized algorithms (WU-BLAST 2.0)
182
Ensembl BLAST output includes an ideogram
184
Outline of today’s lecture
BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
185
Position specific iterated BLAST:
PSI-BLAST The purpose of PSI-BLAST is to look deeper into the database for matches to your query protein sequence by employing a scoring matrix that is customized to your query.
186
PSI-BLAST is performed in five steps
[1] Select a query and search it against a protein database
187
PSI-BLAST is performed in five steps
[1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM)
188
R,I,K C D,E,T K,R,T N,L,Y,G
189
A R N D C Q E G H I L K M F P S T W Y V
... 37 S 38 G 39 T 40 W 41 Y 42 A
190
A R N D C Q E G H I L K M F P S T W Y V
... 37 S 38 G 39 T 40 W 41 Y 42 A
191
PSI-BLAST is performed in five steps
[1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) [3] The PSSM is used as a query against the database [4] PSI-BLAST estimates statistical significance (E values)
193
PSI-BLAST is performed in five steps
[1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) [3] The PSSM is used as a query against the database [4] PSI-BLAST estimates statistical significance (E values) [5] Repeat steps [3] and [4] iteratively, typically 5 times. At each new search, a new profile is used as the query.
194
Results of a PSI-BLAST search
# hits Iteration # hits > threshold
195
PSI-BLAST search: human RBP versus RefSeq, iteration 1
196
PSI-BLAST search: human RBP versus RefSeq, iteration 2
197
PSI-BLAST search: human RBP versus RefSeq, iteration 3
198
RBP4 match to ApoD, PSI-BLAST iteration 1
199
RBP4 match to ApoD, PSI-BLAST iteration 2
200
RBP4 match to ApoD, PSI-BLAST iteration 3
201
The universe of lipocalins (each dot is a protein)
retinol-binding protein odorant-binding protein apolipoprotein D
202
Scoring matrices let you focus on the big (or small) picture
retinol-binding protein Fig. 5.7 Page 151 your RBP query
203
Scoring matrices let you focus on the big (or small) picture
PAM250 PAM30 retinol-binding protein retinol-binding protein Blosum80 Blosum45
204
PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM
retinol-binding protein retinol-binding protein
205
PSI-BLAST: performance assessment
Evaluate PSI-BLAST results using a database in which protein structures have been solved and all proteins in a group share < 40% amino acid identity.
206
PSI-BLAST: the problem of corruption
PSI-BLAST is useful to detect weak but biologically meaningful relationships between proteins. The main source of false positives is the spurious amplification of sequences not related to the query. For instance, a query with a coiled-coil motif may detect thousands of other proteins with this motif that are not homologous. Once even a single spurious protein is included in a PSI-BLAST search above threshold, it will not go away.
207
PSI-BLAST: the problem of corruption
Corruption is defined as the presence of at least one false positive alignment with an E value < 10-4 after five iterations. Three approaches to stopping corruption: [1] Apply filtering of biased composition regions [2] Adjust E value from (default) to a lower value such as E = [3] Visually inspect the output from each iteration. Remove suspicious hits by unchecking the box.
208
Conserved domain database (CDD) uses RPS-BLAST
Main idea: you can search a query protein against a database of position-specific scoring matrices
209
Outline of today’s lecture
BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
210
Multiple sequence alignment to profile HMMs
• in the 1990’s people began to see that aligning sequences to profiles gave much more information than pairwise alignment alone. • Hidden Markov models (HMMs) are “states” that describe the probability of having a particular amino acid residue at arranged in a column of a multiple sequence alignment • HMMs are probabilistic models • Like a hammer is more refined than a blast, an HMM gives more sensitive alignments than traditional techniques such as progressive alignments
217
HMMER: build a hidden Markov model
Determining effective sequence number done. [4] Weighting sequences heuristically done. Constructing model architecture done. Converting counts to probabilities done. Setting model name, etc done. [x] Constructed a profile HMM (length 230) Average score: bits Minimum score: bits Maximum score: bits Std. deviation: bits
218
HMMER: calibrate a hidden Markov model
HMM file: lipocalins.hmm Length distribution mean: 325 Length distribution s.d.: 200 Number of samples: random seed: histogram(s) saved to: [not saved] POSIX threads: HMM : x mu : lambda : max :
219
HMMER: search an HMM against GenBank
Scores for complete sequences (score includes all domains): Sequence Description Score E-value N gi| |ref|XP_ | (XM_129259) ret e gi|132407|sp|P04916|RETB_RAT Plasma retinol e gi| |ref|XP_ | (XM_005907) sim e gi| |ref|NP_ | (NM_006744) ret e gi| |sp|P02753|RETB_HUMAN Plasma retinol e . gi| |ref|NP_ | (NC_003197) out e gi| |ref|NP_ |: domain 1 of 1, from 1 to 195: score 454.6, E = 1.7e-131 *->mkwVMkLLLLaALagvfgaAErdAfsvgkCrvpsPPRGfrVkeNFDv mkwV++LLLLaA + +aAErd Crv+s frVkeNFD+ gi| MKWVWALLLLAA--W--AAAERD------CRVSS----FRVKENFDK 33 erylGtWYeIaKkDprFErGLllqdkItAeySleEhGsMsataeGrirVL +r++GtWY++aKkDp E GL+lqd+I+Ae+S++E+G+Msata+Gr+r+L gi| ARFSGTWYAMAKKDP--E-GLFLQDNIVAEFSVDETGQMSATAKGRVRLL 80 eNkelcADkvGTvtqiEGeasevfLtadPaklklKyaGvaSflqpGfddy +N+++cAD+vGT+t++E dPak+k+Ky+GvaSflq+G+dd+ gi| NNWDVCADMVGTFTDTE DPAKFKMKYWGVASFLQKGNDDH 120 Fig. 5.13 Page 159
220
PFAM is a database of HMMs and an essential resource for protein families
221
Outline of today’s lecture
BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
222
BLAST-related tools for genomic DNA
The analysis of genomic DNA presents special challenges: There are exons (protein-coding sequence) and introns (intervening sequences). There may be sequencing errors or polymorphisms The comparison may between be related species (e.g. human and mouse)
223
BLAST-related tools for genomic DNA
Recently developed tools include: MegaBLAST at NCBI. BLAT (BLAST-like alignment tool). BLAT parses an entire genomic DNA database into words (11mers), then searches them against a query. Thus it is a mirror image of the BLAST strategy. See SSAHA at Ensembl uses a similar strategy as BLAT. See
224
PatternHunter
225
MegaBLAST at NCBI
226
MegaBLAST
227
To access BLAT, visit http://genome.ucsc.edu
“BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 20 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates.” --BLAT website
228
Paste DNA or protein sequence
here in the FASTA format
229
BLAT output includes browser and other formats
230
Blastz
231
Blastz (laj software): human versus rhesus duplication
232
Blastz (laj software): human versus rhesus gap
233
BLAT
234
BLAT
235
LAGAN
236
SSAHA
237
Outline of today’s lecture
BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ
238
Where we are in the course
--We started with “bioinformatics databases” --We next covered pairwise alignment, then BLAST in which one sequence is compared to a database --Next we’ll describe multiple sequence alignment --We’ll then visualize multiple sequence alignments as phylogenetic trees That topic spans molecular evolution. 238
239
Lab exercises Self-Test Quiz
P P
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.