Bioinformatics sequence alignment Blast

Bioinformatics sequence alignment Blast
From Pevsner, Jonathan-<Bioinformatics and Functional Genomics>-Wiley-Blackwell (2015)

Outline: pairwise alignment
Overview and examples Definitions: homologs, paralogs, orthologs Assigning scores to aligned amino acids: Dayhoff’s PAM matrices Alignment algorithms: Needleman-Wunsch, Smith-Waterman

Outline part2 BLAST Practical use Algorithm Strategies
Finding distantly related proteins: PSI-BLAST Hidden Markov models(Bao’s lecture) BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ

Learning objectives Define homologs, paralogs, orthologs
Perform pairwise alignments (NCBI BLAST) Understand how scores are assigned to aligned amino acids using Dayhoff’s PAM matrices Explain how the Needleman-Wunsch algorithm performs global pairwise alignments 4

Pairwise alignments in the 1950s
b-corticotropin促肾上腺皮质激素 (sheep) Corticotropin A (pig) ala gly glu asp asp glu asp gly ala glu asp glu CYIQNCPLG CYFQNCPRG Oxytocin Vasopressin Page 46

Early example of sequence alignment: globins (1961)
myoglobin Early example of sequence alignment: globins (1961) H.C. Watson and J.C. Kendrew, “Comparison Between the Amino-Acid Sequences of Sperm Whale Myoglobin and of Human Hæmoglobin.” Nature 190: , 1961.

Pairwise sequence alignment is the most
fundamental operation of bioinformatics • It is used to decide if two proteins (or genes) are related structurally or functionally • It is used to identify domains or motifs that are shared between proteins It is the basis of BLAST searching • It is used in the analysis of genomes Page 47

Pairwise alignment: protein sequences can be more informative than DNA
• protein is more informative (20 vs 4 characters); many amino acids share related biophysical properties • codons are degenerate: changes in the third position often do not alter the amino acid that is specified • protein sequences offer a longer “look-back” time DNA sequences can be translated into protein, and then used in pairwise alignments

Pairwise alignment: protein sequences can be more informative than DNA
Many times, DNA alignments are appropriate --to confirm the identity of a cDNA --to study noncoding regions of DNA --to study DNA polymorphisms --example: Neanderthal vs modern human DNA Query: 181 catcaactacaactccaaagacacccttacacccactaggatatcaacaaacctacccac 240 |||||||| |||| |||||| ||||| | ||||||||||||||||||||||||||||||| Sbjct: 189 catcaactgcaaccccaaagccacccct-cacccactaggatatcaacaaacctacccac 247

Overview and examples Definitions: homologs, paralogs, orthologs Assigning scores to aligned amino acids: Dayhoff’s PAM matrices Alignment algorithms: Needleman-Wunsch, Smith-Waterman 11

Definition: pairwise alignment
The process of lining up two sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Page 53

Definition: homology Homology
Similarity attributed to descent from a common ancestor. Page 49

myoglobin (NP_005359) 2MM1 Beta globin (NP_000509) 2HHB Page 49

Definitions: two types of homology
Orthologs Homologous sequences in different species that arose from a common ancestral gene during speciation; may or may not be responsible for a similar function. Paralogs Homologous sequences within a single species that arose by gene duplication. Page 49

You can view these sequences at www.bioinfbook.org (document 3.1)
Orthologs: members of a gene (protein) family in various organisms. This tree shows globin orthologs. You can view these sequences at (document 3.1) Page 51

Paralogs: members of a gene (protein) family within a
species. This tree shows human globin paralogs. Page 52

Orthologs and paralogs are often viewed in a single tree
Source: NCBI

General approach to pairwise alignment
Choose two sequences Select an algorithm that generates a score Allow gaps (insertions, deletions) Score reflects degree of similarity Alignments can be global or local Estimate probability that the alignment occurred by chance

Calculation of an alignment score
Source:

Find BLAST from the home page of NCBI and select protein BLAST…

Choose align two or more sequences…
Page 52

Enter the two sequences (as accession numbers or in the fasta format) and click BLAST.
Optionally select “Algorithm parameters” and note the matrix option.

Pairwise alignment result of human beta globin and myoglobin
Myoglobin RefSeq Information about this alignment: score, expect value, identities, positives, gaps… Query = HBB Subject = MB Middle row displays identities; + sign for similar matches Page 53

Pairwise alignment result of human beta globin and myoglobin: the score is a sum of match, mismatch, gap creation, and gap extension scores Page 53

V matching V earns +4 These scores come from
Pairwise alignment result of human beta globin and myoglobin: the score is a sum of match, mismatch, gap creation, and gap extension scores V matching V earns +4 These scores come from T matching L earns -1 a “scoring matrix”! Page 53

Definitions: homology and identity
Similarity attributed to descent from a common ancestor. Identity The extent to which two (nucleotide or amino acid) sequences are invariant. Page 50

Definition: similarity
The extent to which nucleotide or protein sequences are related. It is based upon identity plus conservation. Identity The extent to which two sequences are invariant. Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico-chemical properties of the original residue. Page 51

Definition: pairwise alignment
The process of lining up two sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. Page 53

First gap position scores -11 Second gap position scores -1
Mind the gaps First gap position scores -11 Second gap position scores -1 Gap creation tends to have a large negative score; Gap extension involves a small penalty Page 55

Gaps • Positions at which a letter is paired with a null
are called gaps. • Gap scores are typically negative. • Since a single mutational event may cause the insertion or deletion of more than one residue, the presence of a gap is ascribed more significance than the length of the gap. Thus there are separate penalties for gap creation and gap extension. • In BLAST, it is rarely necessary to change gap values from the default.

Pairwise alignment of retinol-binding protein and b-lactoglobulin:
Example of an alignment with internal, terminal gaps 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | | | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV QYSC 136 RBP || || | :.|||| | | 94 IPAVFKIDALNENKVL VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP . | | | : || | || | 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI lactoglobulin

Pairwise alignment of retinol-binding protein
from human (top) and rainbow trout (O. mykiss): Example of an alignment with few gaps 1 .MKWVWALLLLA.AWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDP 48 :: || || || .||.||. .| :|||:.|:.| |||.||||| 1 MLRICVALCALATCWA...QDCQVSNIQVMQNFDRSRYTGRWYAVAKKDP 47 49 EGLFLQDNIVAEFSVDETGQMSATAKGRVRLLNNWDVCADMVGTFTDTED 98 |||| ||:||:|||||.|.|.||| ||| :||||:.||.| ||| || | 48 VGLFLLDNVVAQFSVDESGKMTATAHGRVIILNNWEMCANMFGTFEDTPD 97 99 PAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAVQYSCRLLNLDGTCADS 148 ||||||:||| ||:|| ||||||::||||| ||: |||| ..||||| | 98 PAKFKMRYWGAASYLQTGNDDHWVIDTDYDNYAIHYSCREVDLDGTCLDG 147 149 YSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIVHNGYCDGRSERNLL 199 |||:||| | || || |||| :..|:| .|| : | |:|: 148 YSFIFSRHPTGLRPEDQKIVTDKKKEICFLGKYRRVGHTGFCESS

Pairwise sequence alignment allows us
to look back billions of years ago (BYA) Origin of life Earliest fossils Origin of eukaryotes Eukaryote/ archaea Fungi/animal Plant/animal insects 4 3 2 1 When you do a pairwise alignment of homologous human and plant proteins, you are studying sequences that last shared a common ancestor 1.5 billion years ago! Page 56

Multiple sequence alignment of
glyceraldehyde 3-phosphate dehydrogenases: 甘油醛3-磷酸酯脱氢酶example of extremely high conservation fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA Page 57

Multiple sequence alignment of human lipocalin paralogs:
example of extremely low conservation ~~~~~EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM lipocalin 1 LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSPVKVTALGGGNLEATFTF odorant-binding protein 2a TKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIVLHR progestagen-assoc. endo. VQENFDVNKYLGRWYEIEKIPTTFENGRCIQANYSLMENGNQELRADGTV apolipoprotein D VKENFDKARFSGTWYAMAKDPEGLFLQDNIVAEFSVDETGNWDVCADGTF retinol-binding protein LQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKSYNVTSVLF neutrophil gelatinase-ass. VQPNFQQDKFLGRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL prostaglandin D2 synthase VQENFNISRIYGKWYNLAIGSTCPWMDRMTVSTLVLGEGEAEISMTSTRW alpha-1-microglobulin PKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD complement component 8 Page 57

lys found at 58% of arg sites
Emile Zuckerkandl and Linus Pauling (1965) considered substitution frequencies in 18 globins (myoglobins and hemoglobins from human to lamprey八目鳗). Black: identity Gray: very conservative substitutions (>40% occurrence) White: fairly conservative substitutions (>21% occurrence) Red: no substitutions observed Page 93

scoring matrix that assigns scores…
Where we’re heading: to a PAM250 log odds scoring matrix that assigns scores… Page 69 40

…and to a whole series of scoring matrices such as PAM10
Page 69 41

Dayhoff’s 34 protein superfamilies
Protein PAMs per 100 million years Ig kappa chain 37 Kappa casein 33 luteinizing hormone b 30 lactalbumin complement component epidermal growth factor 26 proopiomelanocortin 21 pancreatic ribonuclease 21 haptoglobin alpha 20 serum albumin 19 phospholipase A2, group IB 19 prolactin carbonic anhydrase C 16 Hemoglobin a 12 Hemoglobin b 12 Page 59

Protein PAMs per 100 million years Ig kappa chain 37 Kappa casein酪蛋白 33 luteinizing hormone b 30 lactalbumin complement component epidermal growth factor 26 proopiomelanocortin 21 pancreatic ribonuclease 21 haptoglobin alpha 20 serum albumin 19 phospholipase A2, group IB 19 prolactin carbonic anhydrase C 16 Hemoglobin a 12 Hemoglobin b 12 human (NP_005203) versus mouse (NP_031812)

Protein PAMs per 100 million years apolipoprotein A-II lysozyme gastrin myoglobin nerve growth factor myelin basic protein thyroid stimulating hormone b 7.4 parathyroid hormone parvalbumin trypsin insulin calcitonin arginine vasopressin adenylate kinase Page 59

Protein PAMs per 100 million years triosephosphate isomerase vasoactive intestinal peptide 2.6 glyceraldehyde phosph. dehydrogease 2.2 cytochrome c collagen troponin C, skeletal muscle 1.5 alpha crystallin B chain 1.5 glucagon glutamate dehydrogenase 0.9 histone H2B, member Q 0.9 ubiquitin 泛素 0 Page 59

Pairwise alignment of human (NP_005203)
versus mouse (NP_031812) ubiquitin

Dayhoff’s approach to assigning scores for any two aligned amino acid residues
Dayhoff et al. defined the score of two aligned residues i,j as 10 times the log of how likely it is to observe these two residues (based on the empirical observation of how often they are aligned in nature) divided by the background probability of finding these amino acids by chance. This provides a score for each pair of residues. Page 58

Dayhoff’s numbers of “accepted point mutations”:
what amino acid substitutions occur in proteins? Dayhoff (1978) p.346. Page 61

Multiple sequence alignment of
glyceraldehyde 3-phosphate dehydrogenases: columns of residues may have high or low conservation fly GAKKVIISAP SAD.APM..F VCGVNLDAYK PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK GASYDEIKAK human GAAKAVGKVI PELNGKLTGM AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM AIRVPVPNGS ITEFVVDLDD DVTESDVNAA Page 57

The relative mutability of amino acids
Asn His 66 Ser Arg 65 Asp Lys 56 Glu Pro 56 Ala Gly 49 Thr 97 Tyr 41 Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 Val 74 Trp 18 Page 63

The relative mutability of amino acids
Asn His 66 Ser Arg 65 Asp Lys 56 Glu Pro 56 Ala Gly 49 Thr 97 Tyr 41 Ile 96 Phe 41 Met 94 Leu 40 Gln 93 Cys 20 Val 74 Trp 18 Note that alanine is normalized to a value of 100. Trp and cys are least mutable. Asn and ser are most mutable. Page 63

Normalized frequencies of amino acids
Gly 8.9% Arg 4.1% Ala 8.7% Asn 4.0% Leu 8.5% Phe 4.0% Lys 8.1% Gln 3.8% Ser 7.0% Ile 3.7% Val 6.5% His 3.4% Thr 5.8% Cys 3.3% Pro 5.1% Tyr 3.0% Glu 5.0% Met 1.5% Asp 4.7% Trp 1.0% blue=6 codons; red=1 codon These frequencies fi sum to 1 Page 63

Dayhoff’s mutation probability matrix
for the evolutionary distance of 1 PAM We have considered three kinds of information: a table of number of accepted point mutations (PAMs) relative mutabilities of the amino acids normalized frequencies of the amino acids in PAM data This information can be combined into a “mutation probability matrix” in which each element Mij gives the probability that the amino acid in column j will be replaced by the amino acid in row i after a given evolutionary interval (e.g. 1 PAM). Page 63

Dayhoff’s PAM1 mutation probability matrix
Original amino acid Page 66

Each element of the matrix shows the probability that an original amino acid (top) will be replaced by another amino acid (side) Page 66

Substitution Matrix A substitution matrix contains values proportional
to the probability that amino acid i mutates into amino acid j for all pairs of amino acids. Substitution matrices are constructed by assembling a large and diverse sample of verified pairwise alignments (or multiple sequence alignments) of amino acids. Substitution matrices should reflect the true probabilities of mutations occurring through a period of evolution. The two major types of substitution matrices are PAM and BLOSUM.

Point-accepted mutations
PAM matrices: Point-accepted mutations PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids. All the PAM data come from closely related proteins (>85% amino acid identity). Page 63

Page 66

Dayhoff’s PAM0 mutation probability matrix:
the rules for extremely slowly evolving proteins Top: original amino acid Side: replacement amino acid Page 68

Dayhoff’s PAM2000 mutation probability matrix:
the rules for very distantly related proteins PAM A Ala R Arg N Asn D Asp C Cys Q Gln E Glu G Gly 8.7% 4.1% N 4.0% D 4.7% C 3.3% Q 3.8% E 5.0% G 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% 8.9% Top: original amino acid Side: replacement amino acid Page 68

PAM250 mutation probability matrix
Top: original amino acid Side: replacement amino acid Page 68

PAM250 log odds scoring matrix Page 69

Why do we go from a mutation probability matrix to a log odds matrix?
We want a scoring matrix so that when we do a pairwise alignment (or a BLAST search) we know what score to assign to two aligned amino acid residues. Logarithms are easier to use for a scoring system. They allow us to sum the scores of aligned residues (rather than having to multiply them). Page 69

How do we go from a mutation probability matrix to a log odds matrix?
The cells in a log odds matrix consist of an “odds ratio”: the probability that an alignment is authentic the probability that the alignment was random The score S for an alignment of residues a,b is given by: S(a,b) = 10 log10 (Mab/pb) As an example, for tryptophan, S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4 Page 69

Normalized frequencies of amino acids
Arg 4.1% Asn 4.0% Phe 4.0% Gln 3.8% Ile 3.7% His 3.4% Cys 3.3% Tyr 3.0% Met 1.5% Trp 1.0%

What do the numbers mean
in a log odds matrix? S(a,tryptophan) = 10 log10 (0.55/0.010) = 17.4 A score of +17 for tryptophan means that this alignment is 50 times more likely than a chance alignment of two Trp residues. S(a,b) = 17 Probability of replacement (Mab/pb) = x Then 17 = 10 log10 x 1.7 = log10 x 101.7 = x = 50 Page 58

What do the numbers mean
in a log odds matrix? A score of +2 indicates that the amino acid replacement occurs 1.6 times as frequently as expected by chance. A score of 0 is neutral. A score of –10 indicates that the correspondence of two amino acids in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids. Page 58

More conserved Less conserved Rat versus mouse globin Rat versus bacterial globin

two nearly identical proteins
two distantly related proteins page 72

BLOSUM Matrices BLOSUM matrices are based on local alignments.
BLOSUM stands for blocks substitution matrix. BLOSUM62 is a matrix calculated from comparisons of sequences with no less than 62% divergence. Page 70 73

BLOSUM Matrices 100 collapse Percent amino acid identity 62 30
74

BLOSUM Matrices 100 100 100 collapse collapse 62 62 62 collapse
Percent amino acid identity 30 30 30 BLOSUM80 BLOSUM62 BLOSUM30 75

BLOSUM Matrices All BLOSUM matrices are based on observed alignments;
they are not extrapolated from comparisons of closely related proteins. The BLOCKS database contains thousands of groups of multiple sequence alignments. BLOSUM62 is the default matrix in BLAST 2.0. Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships. A search for distant relatives may be more sensitive with a different matrix. Page 72 76

Blosum62 scoring matrix Page 73 77

Point-accepted mutations
PAM matrices: Point-accepted mutations PAM matrices are based on global alignments of closely related proteins. The PAM1 is the matrix calculated from comparisons of sequences with no more than 1% divergence. At an evolutionary interval of PAM1, one change has occurred over a length of 100 amino acids. Other PAM matrices are extrapolated from PAM1. For PAM250, 250 changes have occurred for two proteins over a length of 100 amino acids. All the PAM data come from closely related proteins (>85% amino acid identity). Page 74 78

Two randomly diverging protein sequences change
in a negatively exponential fashion Percent identity “twilight zone” Evolutionary distance in PAMs Page 74 79

At PAM1, two proteins are 99% identical
At PAM10.7, there are 10 differences per 100 residues At PAM80, there are 50 differences per 100 residues At PAM250, there are 80 differences per 100 residues Percent identity “twilight zone” Differences per 100 residues Page 75 80

PAM: “Accepted point mutation”
Two proteins with 50% identity may have 80 changes per 100 residues. (Why? Because any residue can be subject to back mutations.) Proteins with 20% to 25% identity are in the “twilight zone” and may be statistically significantly related. PAM or “accepted point mutation” refers to the “hits” or matches between two sequences (Dayhoff & Eck, 1968) Page 75 81

Percent identity between two proteins: What percent is significant?
100% 80% 65% 30% 23% 19% We will see in the BLAST lecture that it is appropriate to describe significance in terms of probability (or expect) values. As a rule of thumb, two proteins sharing > 30% over a substantial region are usually homologous. 82

Two kinds of sequence alignment:
global and local We will first consider the global alignment algorithm of Needleman and Wunsch (1970). We will then explore the local alignment algorithm of Smith and Waterman (1981). Finally, we will consider BLAST, a heuristic version of Smith-Waterman. We will cover BLAST in detail Next time. Page 76 84

Global alignment with the algorithm of Needleman and Wunsch (1970)
• Two sequences can be compared in a matrix along x- and y-axes. • If they are identical, a path along a diagonal can be drawn • Find the optimal subpaths, and add them up to achieve the best score. This involves --adding gaps when needed --allowing for conservative substitutions --choosing a scoring system (simple or complicated) N-W is guaranteed to find optimal alignment(s) Page 76 85

Three steps to global alignment with the Needleman-Wunsch algorithm
[1] set up a matrix [2] score the matrix [3] identify the optimal alignment(s) Page 76 86

Four possible outcomes in aligning two sequences
1 2 [1] identity (stay along a diagonal) [2] mismatch (stay along a diagonal) [3] gap in one sequence (move vertically!) [4] gap in the other sequence (move horizontally!) Page 77 87

Start Needleman-Wunsch with an identity matrix
Page 77 89

Start Needleman-Wunsch with an identity matrix
Page 77 90

Fill in the matrix using “dynamic programming”
Page 78 91

Page 78 92

Page 78 93

Page 78 94

Page 78 95

Page 78 96

Page 78 97

Traceback to find the optimal (best) pairwise alignment
Page 79 98

Needleman-Wunsch: dynamic programming
N-W is guaranteed to find optimal alignments, although the algorithm does not search all possible alignments. It is an example of a dynamic programming algorithm: an optimal path (alignment) is identified by incrementally extending optimal subpaths. Thus, a series of decisions is made at each step of the alignment to find the pair of residues with the best score. Page 80 99

Try using needle to implement a Needleman-Wunsch global alignment algorithm to find the optimum alignment (including gaps): Page 81 100

Queries: beta globin (NP_000509) alpha globin (NP_000549) 101

Global alignment versus local alignment
Global alignment (Needleman-Wunsch) extends from one end of each sequence to the other. Local alignment finds optimally matching regions within two sequences (“subsequences”). Local alignment is almost always used for database searches such as BLAST. It is useful to find domains (or limited regions of homology) within sequences. Smith and Waterman (1981) solved the problem of performing optimal local sequence alignment. Other methods (BLAST, FASTA) are faster but less thorough. Page 82 103

How the Smith-Waterman algorithm works
Set up a matrix between two proteins (size m+1, n+1) No values in the scoring matrix can be negative! S > 0 The score in each cell is the maximum of four values: [1] s(i-1, j-1) + the new score at [i,j] (a match or mismatch) [2] s(i,j-1) – gap penalty [3] s(i-1,j) – gap penalty [4] zero Page 82 104

Smith-Waterman algorithm allows the alignment of subsets of sequences
Page 83 105

Try using SSEARCH to perform a rigorous Smith-Waterman local alignment: 106

Queries: beta globin (NP_000509) alpha globin (NP_000549) 107

Rapid, heuristic versions of Smith-Waterman:
FASTA and BLAST Smith-Waterman is very rigorous and it is guaranteed to find an optimal alignment. But Smith-Waterman is slow. It requires computer space and time proportional to the product of the two sequences being aligned (or the product of a query against an entire database). Gotoh (1982) and Myers and Miller (1988) improved the algorithms so both global and local alignment require less time and space. FASTA and BLAST provide rapid alternatives to S-W. Page 84 109

Statistical significance of pairwise alignment
We will discuss the statistical significance of alignment scores in the next lecture (BLAST). A basic question is how to determine whether a particular alignment score is likely to have occurred by chance. According to the null hypothesis, two aligned sequences are not homologous (evolutionarily related). Can we reject the null hypothesis at a particular significance level alpha? 110

Pairwise alignment: key points
Pairwise alignments allow us to describe the percent identity two sequences share, as well as the percent similarity The score of a pairwise alignment includes positive values for exact matches, and other scores for mismatches and gaps PAM and BLOSUM matrices provide a set of rules for assigning scores. PAM10 and BLOSUM80 are examples of matrices appropriate for the comparison of closely related sequences. PAM250 and BLOSUM30 are examples of matrices used to score distantly related proteins. Global and local alignments can be made.

BLAST BLAST (Basic Local Alignment Search Tool)
allows rapid sequence comparison of a query sequence against a database. The BLAST algorithm is fast, accurate, and web-accessible.

Why use BLAST? BLAST searching is fundamental to understanding
the relatedness of any favorite query sequence to other known proteins or DNA sequences. Applications include identifying orthologs and paralogs discovering new genes or proteins discovering variants of genes or proteins investigating expressed sequence tags (ESTs) exploring protein structure and function

Four components to a BLAST search
(1) Choose the sequence (query) (2) Select the BLAST program (3) Choose the database to search (4) Choose optional parameters Then click “BLAST”

Step 1: Choose your sequence
Sequence can be input in FASTA format or as accession number

Example of the FASTA format for a BLAST query

Step 2: Choose the BLAST program

Step 2: Choose the BLAST program
blastn (nucleotide BLAST) blastp (protein BLAST) tblastn (translated BLAST) blastx (translated BLAST) tblastx (translated BLAST)

Choose the BLAST program
Program Input Database 1 blastn DNA DNA blastp protein protein 6 blastx DNA protein tblastn protein DNA 36 tblastx DNA DNA

DNA potentially encodes six proteins
5’ CAT CAA 5’ ATC AAC 5’ TCA ACT 5’ CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACCCAC 3’ 3’ GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTTTGGATGGGTG 5’ 5’ GTG GGT 5’ TGG GTA 5’ GGG TAG

Step 3: choose the database
nr = non-redundant (most general database) dbest = database of expressed sequence tags dbsts = database of sequence tag sites gss = genomic survey sequences htgs = high throughput genomic sequence

Step 4a: Select optional search parameters
organism Entrez! algorithm

Step 4a: optional blastp search parameters
Expect Word size Scoring matrix Filter, mask

Step 4a: optional blastn search parameters
Expect Word size Match/mismatch scores Filter, mask

BLAST: optional parameters
You can... • choose the organism to search • turn filtering on/off • change the substitution matrix • change the expect (e) value • change the word size • change the output format

(a) Query: human insulin NP_000198
Program: blastp Database: C. elegans RefSeq Default settings: Unfiltered (“composition-based statistics”) Our starting point: search human insulin against worm RefSeq proteins by blastp using default parameters

(b) Query: human insulin NP_000198
Program: blastp Database: C. elegans RefSeq Option: No compositional adjustment Note that the bit score, Expect value, and percent identity all change with the “no compositional adjustment” option

(c) Query: human insulin NP_000198
Program: blastp Database: C. elegans RefSeq Option: conditional compositional score matrix adjustment Note that the bit score, Expect value, and percent identity all change with the compositional score matrix adjustment

(d) Query: human insulin NP_000198
Program: blastp Database: C. elegans RefSeq Option: Filter low complexity regions Note that the bit score, Expect value, and percent identity all change with the filter option

(the filtered sequence is the query in lowercase and grayed out)
(e) Query: human insulin NP_000198 Program: blastp Database: C. elegans RefSeq Option: Mask for lookup table only Filtering (the filtered sequence is the query in lowercase and grayed out)

(e) Query: human insulin NP_000198
Program: blastp Database: C. elegans RefSeq Option: Mask for lookup table only Note that the bit score, Expect value, and percent identity could change with the “mask for lookup table only” option

Step 4b: optional formatting parameters
Alignment view Descriptions Alignments

BLAST format options

BLAST search output: multiple alignment format

BLAST search output: top portion
database query program taxonomy

taxonomy

BLAST search output: graphical output

BLAST search output: tabular output
High scores low E values Cut-off: .05? 10-10?

BLAST search output: alignment output

Outline of today’s lecture
BLAST Practical use Algorithm Strategies Finding distantly related proteins: PSI-BLAST Hidden Markov models BLAST-like tools for genomic DNA PatternHunter Megablast BLAT, BLASTZ

BLAST: background on sequence alignment
There are two main approaches to sequence alignment: [1] Global alignment (Needleman & Wunsch 1970) using dynamic programming to find optimal alignments between two sequences. (Although the alignments are optimal, the search is not exhaustive.) Gaps are permitted in the alignments, and the total lengths of both sequences are aligned (hence “global”).

BLAST: background on sequence alignment
[2] The second approach is local sequence alignment (Smith & Waterman, 1980). The alignment may contain just a portion of either sequence, and is appropriate for finding matched domains between sequences. BLAST is a heuristic approximation to local alignment. It examines only part of the search space.

How a BLAST search works
“The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T.” Altschul et al. (1990)

How the original BLAST algorithm works:
three phases Phase 1: compile a list of word pairs (w=3) above threshold T Example: for a human RBP query …FSGTWYA… (query word is in yellow) A list of words (w=3) is: FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS

Phase 1: compile a list of words (w=3)
GTW 6,5,11 22 neighborhood GSW 6,1,11 18 word hits ATW 0,5,11 16 > threshold NTW 0,5,11 16 GTY 6,5,2 13 GNW 10 neighborhood GAW 9 word hits < below threshold (T=11)

Pairwise alignment scores are determined using a scoring matrix such asBlosum62

How a BLAST search works: 3 phases
Scan the database for entries that match the compiled list. This is fast and relatively easy.

Phase 3: when you manage to find a hit (i.e. a match between a “word” and a database entry), extend the hit in either direction. Keep track of the score (use a scoring matrix) Stop when the score drops below some cutoff. KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) extend extend Hit!

In the original (1990) implementation of BLAST, hits were extended in either direction. In a 1997 refinement of BLAST, two independent hits are required. The hits must occur in close proximity to each other. With this modification, only one seventh as many extensions occur, greatly speeding the time required for a search.

How a BLAST search works: threshold
You can modify the threshold parameter. The default value for blastp is 11. To change it, enter “-f 16” or “-f 5” in the advanced options of BLAST+. (To find BLAST+ go to BLAST  help  download.)

Phase 1: compile a list of words (w=3)
GTW 6,5,11 22 neighborhood GSW 6,1,11 18 word hits ATW 0,5,11 16 > threshold NTW 0,5,11 16 GTY 6,5,2 13 GNW 10 neighborhood GAW 9 word hits < below threshold (T=11)

For blastn, the word size is typically 7, 11, or 15 (EXACT match)
For blastn, the word size is typically 7, 11, or 15 (EXACT match). Changing word size is like changing threshold of proteins. w=15 gives fewer matches and is faster than w=11 or w=7. For megablast (see below), the word size is 28 and can be adjusted to 64. What will this do? Megablast is VERY fast for finding closely related DNA sequences!

How to interpret a BLAST search: expect value
It is important to assess the statistical significance of search results. For global alignments, the statistics are poorly understood. For local alignments (including BLAST search results), the statistics are well understood. The scores follow an extreme value distribution (EVD) rather than a normal distribution.

normal probability distribution x 0.40 0.35 0.30 0.25 0.20 0.15 0.10
0.05 -5 -4 -3 -2 -1 1 2 3 4 5 x

The probability density function of the extreme
value distribution (characteristic value u=0 and decay constant l=1) 0.40 0.35 0.30 0.25 normal distribution extreme value distribution probability 0.20 0.15 0.10 0.05 -5 -4 -3 -2 -1 1 2 3 4 5 x

How to interpret a BLAST search: expect value
The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p. The key equation describing an E value is: E = Kmn e-lS

E = Kmn e-lS This equation is derived from a description of the extreme value distribution S = the score E = the expect value = the number of high- scoring segment pairs (HSPs) expected to occur with a score of at least S m, n = the length of two sequences l, K = Karlin Altschul statistics

Some properties of the equation E = Kmn e-lS
The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). Very high scores correspond to very low E values. The E value for aligning a pair of random sequences must be negative! Otherwise, long random alignments would acquire great scores Parameter K describes the search space (database). For E=1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly

From raw scores to bit scores
There are two kinds of scores: raw scores (calculated from a substitution matrix) and bit scores (normalized scores) Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ = bit score = (lS - lnK) / ln2 The E value corresponding to a given bit score is: E = mn 2 -S’ Bit scores allow you to compare results between different database searches, even using different scoring matrices.

How to interpret BLAST: E values and p values
The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. A p value is a different way of representing the significance of an alignment. p = 1 - e-E

How to interpret BLAST: E values and p values
Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpret than corresponding p values. E p (about 0.1) (about 0.05) (about 0.001)

How to interpret BLAST: overview

word size w = 3 10 is the E value gap penalties BLOSUM matrix threshold score = 11 length of database

EVD parameters 147 – 111 = 36 m n mn Effective search space = mn = length of query x db length

Why set the E value to 20,000? Suppose you perform a search with a short query (e.g. 9 amino acids). There are not enough residues to accumulate a big score (or a small E value). Indeed, a match of 9 out of 9 residues could yield a small score with an E value of 100 or 200. And yet, this result could be “real” and of interest to you. By setting the E value cutoff to 20,000 you do not change the way the search was done, but you do change which results are reported to you.

BLAST search strategies
General concepts How to evaluate the significance of your results How to handle too many results How to handle too few results BLAST searching with HIV-1 pol, a multidomain protein

Sometimes a real match has an E value > 1
…try a reciprocal BLAST to confirm

Sometimes a similar E value occurs for a
short exact match and long less exact match short, nearly exact long, only 31% identity, similar E value

Assessing whether proteins are homologous
RBP4 and PAEP: Low bit score, E value 0.49, 24% identity (“twilight zone”). But they are indeed homologous. Try a BLAST search with PAEP as a query, and find many other lipocalins.

The universe of lipocalins (each dot is a protein)
retinol-binding protein odorant-binding protein apolipoprotein D

BLAST search with PAEP as a query finds many other lipocalins

Using human beta globin as a query, here are the blastp results searching against human RefSeq proteins (PAM30 matrix). Where is myoglobin? It’s absent! We need to use PSI-BLAST.

Two problems standard BLAST cannot solve
[1] Use human beta globin as a query against human RefSeq proteins, and blastp does not “find” human myoglobin. This is because the two proteins are too distantly related. PSI-BLAST at NCBI as well as hidden Markov models easily solve this problem. [2] How can we search using 10,000 base pairs as a query, or even millions of base pairs? Many BLAST-like tools for genomic DNA are available such as PatternHunter, Megablast, BLAT, and BLASTZ. 179

Specialized BLAST servers
Organism-specific BLAST sites Molecule-specific BLAST sites Specialized algorithms (WU-BLAST 2.0)

Ensembl BLAST output includes an ideogram

Position specific iterated BLAST:
PSI-BLAST The purpose of PSI-BLAST is to look deeper into the database for matches to your query protein sequence by employing a scoring matrix that is customized to your query.

PSI-BLAST is performed in five steps
[1] Select a query and search it against a protein database

[1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM)

R,I,K C D,E,T K,R,T N,L,Y,G

A R N D C Q E G H I L K M F P S T W Y V
... 37 S 38 G 39 T 40 W 41 Y 42 A

[1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) [3] The PSSM is used as a query against the database [4] PSI-BLAST estimates statistical significance (E values)

[1] Select a query and search it against a protein database [2] PSI-BLAST constructs a multiple sequence alignment then creates a “profile” or specialized position-specific scoring matrix (PSSM) [3] The PSSM is used as a query against the database [4] PSI-BLAST estimates statistical significance (E values) [5] Repeat steps [3] and [4] iteratively, typically 5 times. At each new search, a new profile is used as the query.

Results of a PSI-BLAST search
# hits Iteration # hits > threshold

PSI-BLAST search: human RBP versus RefSeq, iteration 1

RBP4 match to ApoD, PSI-BLAST iteration 1

The universe of lipocalins (each dot is a protein)
retinol-binding protein odorant-binding protein apolipoprotein D

Scoring matrices let you focus on the big (or small) picture
retinol-binding protein Fig. 5.7 Page 151 your RBP query

Scoring matrices let you focus on the big (or small) picture
PAM250 PAM30 retinol-binding protein retinol-binding protein Blosum80 Blosum45

PSI-BLAST generates scoring matrices more powerful than PAM or BLOSUM
retinol-binding protein retinol-binding protein

PSI-BLAST: performance assessment
Evaluate PSI-BLAST results using a database in which protein structures have been solved and all proteins in a group share < 40% amino acid identity.

PSI-BLAST: the problem of corruption
PSI-BLAST is useful to detect weak but biologically meaningful relationships between proteins. The main source of false positives is the spurious amplification of sequences not related to the query. For instance, a query with a coiled-coil motif may detect thousands of other proteins with this motif that are not homologous. Once even a single spurious protein is included in a PSI-BLAST search above threshold, it will not go away.

PSI-BLAST: the problem of corruption
Corruption is defined as the presence of at least one false positive alignment with an E value < 10-4 after five iterations. Three approaches to stopping corruption: [1] Apply filtering of biased composition regions [2] Adjust E value from (default) to a lower value such as E = [3] Visually inspect the output from each iteration. Remove suspicious hits by unchecking the box.

Conserved domain database (CDD) uses RPS-BLAST
Main idea: you can search a query protein against a database of position-specific scoring matrices

Multiple sequence alignment to profile HMMs
• in the 1990’s people began to see that aligning sequences to profiles gave much more information than pairwise alignment alone. • Hidden Markov models (HMMs) are “states” that describe the probability of having a particular amino acid residue at arranged in a column of a multiple sequence alignment • HMMs are probabilistic models • Like a hammer is more refined than a blast, an HMM gives more sensitive alignments than traditional techniques such as progressive alignments

HMMER: build a hidden Markov model
Determining effective sequence number done. [4] Weighting sequences heuristically done. Constructing model architecture done. Converting counts to probabilities done. Setting model name, etc done. [x] Constructed a profile HMM (length 230) Average score: bits Minimum score: bits Maximum score: bits Std. deviation: bits

HMMER: calibrate a hidden Markov model
HMM file: lipocalins.hmm Length distribution mean: 325 Length distribution s.d.: 200 Number of samples: random seed: histogram(s) saved to: [not saved] POSIX threads: HMM : x mu : lambda : max :

HMMER: search an HMM against GenBank
Scores for complete sequences (score includes all domains): Sequence Description Score E-value N gi| |ref|XP_ | (XM_129259) ret e gi|132407|sp|P04916|RETB_RAT Plasma retinol e gi| |ref|XP_ | (XM_005907) sim e gi| |ref|NP_ | (NM_006744) ret e gi| |sp|P02753|RETB_HUMAN Plasma retinol e . gi| |ref|NP_ | (NC_003197) out e gi| |ref|NP_ |: domain 1 of 1, from 1 to 195: score 454.6, E = 1.7e-131 *->mkwVMkLLLLaALagvfgaAErdAfsvgkCrvpsPPRGfrVkeNFDv mkwV++LLLLaA + +aAErd Crv+s frVkeNFD+ gi| MKWVWALLLLAA--W--AAAERD------CRVSS----FRVKENFDK 33 erylGtWYeIaKkDprFErGLllqdkItAeySleEhGsMsataeGrirVL +r++GtWY++aKkDp E GL+lqd+I+Ae+S++E+G+Msata+Gr+r+L gi| ARFSGTWYAMAKKDP--E-GLFLQDNIVAEFSVDETGQMSATAKGRVRLL 80 eNkelcADkvGTvtqiEGeasevfLtadPaklklKyaGvaSflqpGfddy +N+++cAD+vGT+t++E dPak+k+Ky+GvaSflq+G+dd+ gi| NNWDVCADMVGTFTDTE DPAKFKMKYWGVASFLQKGNDDH 120 Fig. 5.13 Page 159

PFAM is a database of HMMs and an essential resource for protein families

BLAST-related tools for genomic DNA
The analysis of genomic DNA presents special challenges: There are exons (protein-coding sequence) and introns (intervening sequences). There may be sequencing errors or polymorphisms The comparison may between be related species (e.g. human and mouse)

BLAST-related tools for genomic DNA
Recently developed tools include: MegaBLAST at NCBI. BLAT (BLAST-like alignment tool). BLAT parses an entire genomic DNA database into words (11mers), then searches them against a query. Thus it is a mirror image of the BLAST strategy. See SSAHA at Ensembl uses a similar strategy as BLAT. See

PatternHunter

MegaBLAST at NCBI

MegaBLAST

To access BLAT, visit http://genome.ucsc.edu
“BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 20 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates.” --BLAT website

Paste DNA or protein sequence
here in the FASTA format

BLAT output includes browser and other formats

Blastz

Blastz (laj software): human versus rhesus duplication

Blastz (laj software): human versus rhesus gap

Where we are in the course
--We started with “bioinformatics databases” --We next covered pairwise alignment, then BLAST in which one sequence is compared to a database --Next we’ll describe multiple sequence alignment --We’ll then visualize multiple sequence alignments as phylogenetic trees That topic spans molecular evolution. 238

Lab exercises Self-Test Quiz
P P

Bioinformatics sequence alignment Blast

Similar presentations

Presentation on theme: "Bioinformatics sequence alignment Blast"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bioinformatics sequence alignment Blast

Similar presentations

Presentation on theme: "Bioinformatics sequence alignment Blast"— Presentation transcript:

Similar presentations

About project

Feedback