Presentation is loading. Please wait.

Presentation is loading. Please wait.

Step 3: Tools Database Searching

Similar presentations


Presentation on theme: "Step 3: Tools Database Searching"— Presentation transcript:

1 Step 3: Tools Database Searching
Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters BLAST, output Alignment significance in BLAST Database similarity searches is one of the first and most important steps in analysing a new sequence. If your unknown sequence has a similar copy already in the databases, a search will quickly reveal this fact and if the copy is well annotated you need go little further in trying to identify your sequence. Database searches usually provide the first clues of whether the sequence belongs to an already studied and well known protein family. If there is a similarity to a sequence that is from another species, then they may be homologous (i.e. sequences that descended from a common ancestral sequence). Knowing the function of a similar/homologous sequence will often give a good indication of the identity of the unknown sequence. N.B. You should bear in mind that in order to identify homologous sequences, searches should be made at the protein sequence level, because it is about 5 times more sensitive at finding matches. ©CMBI 2003

2 Why do sequence database searching
Identify similarities between novel query sequences whose structures and functions are unknown and uncharacterized and sequences in (public) databases whose structures and functions have been elucidated. N.B. The similarity might span the entire query sequence or just part of it! Why do sequence database searching •What have I cloned ? •Is this really “my gene” ? •Has someone else already found it ? •Is it interesting anyway? •What is it related to ? •Can I get more sequence easily ? ©CMBI 2003

3 Database searching (2) The query sequence is compared/aligned with every sequence in the database. High-scoring database sequences are assumed to be evolutionary related to the query sequence. If sequences are related by divergence from a common ancestor, there are said to be homologous. We will come back later on significance in the BLAST output…. The Assumptions The sequences being sought have an evolutionary ancestral sequence in common with the query sequence The best guess at the actual path of evolution is the path that requires the fewest evolutionary events (most parsimonious) All substitutions are not equally likely and should be weighted accordingly Insertions and deletions are less likely than substitutions and should be weighted accordingly Identification of relationships between novel and database sequences is relatively easy when levels of similarity are high (> 50% sequence identity) Below 50% sequence identity it becomes increasingly difficult to establish relationships reliably ©CMBI 2003

4 gap = insertion or deletion
Sequence Alignment The purpose of a sequence alignment is to line up all residues in the sequence that were derived from the same residue position in the ancestral gene or protein in any number of sequences Wat is het belangrijkste residue voor alignen? Cys, want meest geconserveerd A multiple sequence alignment is a 2D table, in which the rows represent individual sequences, and the columns the residue positions. Sequences are laid onto this grid in such a manner that (a) the relative positioning of residues within any one sequence is preserved, and (b) similar residues in all the sequences are brought into vertical register. gap = insertion or deletion J.Leunissen©CMBI 2003

5 Scoring Matrix/Substitution Matrix
To score quality of an alignment Contains scores for pairs of residues (amino acids or nucleic acids) in a sequence alignment For protein/protein comparisons: a 20 x 20 matrix of similarity scores where identical amino acids and those of similar character (e.g. Ile, Leu) give higher scores compared to those of different character (e.g. Ile, Asp). Symmetric Contains scores for matches between residues, according to observed substitution rates across large evolutionary distances Scoring Matrices are designed to detect signal above background, to detect similarities beyond what would be observed by chance alone. All algorithms to compare protein sequences rely on some scheme to score the equivalencing of each of the 210 possible pairs of amino acids. (i.e. 190 pairs of different amino acids + 20 pairs of identical amino acids). 20x20=400-20=380/2=190 The choice of matrix determines both the pattern and the extent of substitution in the sequences the database search is most likely to discover ©CMBI 2003

6 Substitution Matrices
Not all amino acids are equal Some are more easily substituted than others Some mutations occur more often Some substitions are kept more often Mutations tend to favor some substitutions Some amino acids have similar codons They are more likely to be changed from DNA mutation Selection tends to favor some substitutions Some amino acids have similar properties/structure They are more likely to be kept The two forces together yield substitution matrices (From computational biology) Example of CODONS: TTT & TTC code for Phe TTA & TTG code for Leu ©CMBI 2003

7 PAM250 Matrix 1) Notice 1 lettercode for the amino acids on both axes are the 20 aa note blocks of similar amino acids 2) Symmetric, only one half shown 3) Diagonal: * For example: high score for matching Tryptophans and “low” score for matching Alanines. * Cysteine * Leu abundant 4) Off-diagonal Groups of similar amino acids K -> F -5 A score above zero assigned to two amino acids indicates that these two .. Each other more often than expected by chance alone. Ie they are functionall.. Exchangable A negative score indicates that the two amino acids are rarely .. Interchangeable. Eg. A basic amino acids for an acidic one or one with an … side chain for one with aliphatic side chain. ©CMBI 2003

8 Scoring example 1 12 12 6 2 5 -1 2 6 1 0 => alignment score = 46
Score of an alignment is the sum of the scores of all pairs of residues in the alignment sequence 1: TCCPSIVARSN sequence 2: SCCPSISARNT => alignment score = 46 ©CMBI 2003

9 Dayhoff Matrix (1) Derived from how often different amino acids replace other amino acids in evolution. Created from a dataset of closely similar protein sequences (less than 15% amino acid difference). These could be unambiguously aligned. A mutation probability matrix whas derived where the entries reflect the probabilities of a mutational event. This matrix is called PAM 1. An evolutionary distance of 1 PAM (point accepted mutation) means there has been 1 point mutation per 100 residues Possibly the most widely used scheme for scoring amino acid pairs is that developed by Dayhoff and co-workers. The system arose out of a general model for the evolution of proteins. 1978!!!, 1572 changes in 71 groups of closely related proteins. Atlas of Protein Sequences. Dataset of 71 aligned sequences? Newer PAM matrices do not differ greatly from the original ones Dayhoff and co workers examined alignments of closely similar sequences where the the likelihood of a particular mutation (e. A-D) being the result of a set of successive mutations (eg. A-x-y-D) was low. Since relatively few families were considered, the resulting matrix of accepted point mutations included a large number of entries equal to 0 or 1. A complete picture of the mutation process including those amino acids which did not change was determined by calculating the average ratio of the number of changes a particular amino acid type underwent to the total number of amino acids of that type present in the database. for example after 2 PAM (Percentage of Acceptable point Mutations per 10^8 years). An evolutionary distance of 1 PAM means there has been 1 point mutation per 100 residues (percent accepted mutation?) 1 PAM corresponds to an average change in 1% of all amino acids positions. Take a list of aligned proteins every time you see a substitution between two amino acids, increment the similarity score betweent them must normalize it by how often amino acids occur in general. Rare amino acids will give rare substitutions PAM model of molecular evolution After 100 PAMs of evolution, not every residue will have changed: some will have mutated several times, perhaps returning to their original state, and others not at all. Note that there is no general correspondence between PAM distance and evolutionary time, as different protein families evolve at different rates. The probabilities represent the average mutational change that will take place when 1 residue out of 100 undergo mutation = 1 PAM (Point Accepted Mutation). 2 sequences 1 PAM apart have 99% identical residues ©CMBI 2003

10 Dayhoff Matrix (2) Log odds matrix: logs of elements of PAM matrix.
Score of mutation A  B observed ab mutation rate mutation rate expected from amino acid frequencies When using a log odds matrix, the total score of the alignment is given by the sum of the scores for each aligned pair of residues. = log ©CMBI 2003

11 Dayhoff Matrix (3) PAM 1 may be used to generate matrices for greater evolutionary distances by multiplying it repeatedly by itself. PAM250: 2,5 mutations per residue equivalent to 20% matches remaining between two sequences, i.e. 80% of the amino acid positions are observed to have changed. is default in many analysis packages. 2 PAM = 108 jaar?? Opzoeken…. However, in principle, it is more effective to use a matrix that corresponds to the actual evolutionary distance between the sequences being compared PAM250: approximately 80 % of the amino acid positions are observed to have changed. Rule of thumb PAM 1 = 1 million year PAM 10: on diagonal S=7 W=13 Man vs gorilla on avg 1-2 aa different Je kunt niets zeggen over uitwisselingen PAM 85: on diagnoal 4-13 Man & horse? PAM 250 dit is ongeveer hoe verje kunt gaan, daarna vgl je mens met slime mold Take powers of this matrix PAM1 PAM250 corresponds to ca. 20% overall sequence identity, is the lowest sequence seq sim for which we can hope to produce a correct alignment by sequence analysis alone. ©CMBI 2003

12 BLOSUM Matrix Limit of Dayhoff matrix: Matrices based on the Dayhoff model of evolutionary rates are of limited value because their substitution rates are derived from alignments of sequences that are at least 85% identical An alternative approach has been developed by Henikoff and Henikoff using local multiple alignments of more distantly related sequences NO EXTRAPOLATION NECESSARY They examine multiple alignments of distantly related proteins directly, rather than extrapolate from closely related sequences. Advantage: it cleaves closer to observation; a disadvantage is that it yields no evolutionary model. A number of tests suggest that the BLOSUM matrices produced by this method are generally superior to thte PAM matrices for detecting biological relationships. ©CMBI 2003

13 BLOSUM Matrix (2) The BLOSUM matrices (BLOcks SUbstitution Matrix) are based on the BLOCKS database. The BLOCKS database utilizes the concept of blocks (ungapped amino acid pattern), which act as signatures of a family of proteins. Substitution frequencies for all pairs of amino acids were then calculated and this used to calculate a log odds BLOSUM matrix. Different matrices are obtained by varying the identity threshold. For example, the BLOSUM80 matrix was derived using blocks of 80% identity. . Built only form the most conserved domains of the blocks database of conserved proteins. Dataset: 2000 blocks of aligned sequence … segments characterizing more than 500 groups of related proteins (1992) ©CMBI 2003

14 Which Matrix to use? Close relationships (Low PAM, high Blosum) Distant relationships (High PAM, low Blosum) Reasonable defaults: PAM250, BLOSUM62 At the level of 2,000 PAM Schwartz and Dayhoff suggest that all the information present in the matrix has degenerated except that the matrix element for Cys-Cys is 10% higher than would be expected by chance. At the evolutionary distance of 256 PAMs one amino acid in five remains unchanged but the amino acids vary in their mutability; 48% of the tryptophans, 41% of the cysteines and 20% of the histidines would be unchanged, but only 7% of serines would remain. J.Kissinger

15 Significance of alignment (1)
When is an alignment statistically significant? In other words: How much different is the alignment score found from scores obtained by aligning random sequences to the query sequence? Or: What is the probability that an alignment with this score could have arisen by chance? ©CMBI 2003

16 Significance of alignment (2)
Database size= 20 x 106 letters peptide #hits A 1 x 106 AP IAP LIAP 125 WLIAP 6 KWLIAP 0,3 KWLIAPY 0,015 Swissprot 30 Mletters, maar 20 rekent makkelijker Hexapeptide search: 206=64 x 106 possibilities SwissProt 30 x 106 letters => Hexapeptide found in SwissProt is pure chance Tripeptide search: 203=8000 possibilities If size of database is 8000 letters, every tripeptide occurs once! Always remember: Mathematical significance  Biological significance ©CMBI 2003

17 BLAST – Basic Local Alignment Search Tool
Find the highest scoring locally optimal alignments between a query sequence and a database. Very fast algorithm Can be used to search extremely large databases (uses a pre-indexed database which contributes to its great speed) Sufficiently sensitive and selective for most purposes Robust – the default parameters can usually be used What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? Searching a data base needs to be fast and sensitive but the two objectives counteract each other and has a high sensitivity for detecting distant sequence relationships between a query sequence and a database. Input=seq Output= list of seq that match query sequence ©CMBI 2003

18 BLAST Algorithm, Step 1 For a given word length w (usually 3 for proteins) and a given score matrix: Create a list of all words (w-mers) that can can score >T when compared to w-mers from the query. P Q A 12 P Q N 12 etc. Below Threshold (T=13) Query Sequence L N K C K T P Q G Q R L V N Q P Q G 18 P E G 15 P R G 14 P K G 14 P N G 13 Neighborhood Words Word P M G 13 P D G 13 ©CMBI 2003

19 BLAST Algorithm, Step 2 Each neighborhood word gives all positions in the database where it is found (hit list). P D G 13 P Q G 18 P E G 15 P R G 14 P K G 14 P N G 13 P M G 13 PMG Database ©CMBI 2003

20 BLAST Algorithm, Step 3 The program tries to extend matching segments (seeds) out in both directions by adding pairs of residues. Residues will be added until the incremental score drops below a threshold. ©CMBI 2003

21 Basic BLAST Algorithms
BLASTN - compares a nucleotide query to a nucleotide database BLASTP - compares a protein query to a protein database BLASTX - compares a nucleotide query sequence translated in all reading frames against a protein sequence database TBLASTN - compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. TBLASTX - compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. ©CMBI 2003

22 PSI-BLAST Position-Specific Iterated BLAST
Distant relationships are often best detected by motif or profile searches rather than pairwise comparisons PSI-BLAST first performs a gapped BLAST database search. The PSI-BLAST program uses the information from any significant alignments returned to construct a position-specific score matrix, which replaces the query sequence for the next round of database searching. PSI-BLAST may be iterated until no new significant alignments are found. No further details given here. Needed in cases of low homology, when no clear significant hits can be found Distant relationships are often best detected by motif or profile searches rather than pairwise comparisons PSI-BLAST searches are iterated, with a position-specific matrix generated from significant alignments found in round i used in round i + 1. BLAST uses a generalized matrix May not be as sensitive as motif search but is very general and easy to use. ©CMBI 2003

23 BLAST Input Steps in running BLAST:
Entering your query sequence (cut-and-paste) Select the database(s) you want to search Choose output parameters Choose alignment parameters (e.g. scoring matrix, filters,….) Example query= MAFIWLLSCYALLGTTFGCGVNAIHPVLTGLSKIVNGEEAVPGTWPWQVTLQDRSGFHFC GGSLISEDWVVTAAHCGVRTSEILIAGEFDQGSDEDNIQVLRIAKVFKQPKYSILTVNND ITLLKLASPARYSQTISAVCLPSVDDDAGSLCATTGWGRTKYNANKSPDKLERAALPLLT NAECKRSWGRRLTDVMICGAASGVSSCMGDSGGPLVCQKDGAYTLVAIVSWASDTCSASS GGVYAKVTKIIPWVQKILSSN What database sequences are most similar to (or contain the most similar regions to) my previously uncharacterised sequence? Searching a data base needs to be fast and sensitive but the two objectives counteract each other. In Exercises we will discuss filters Low complexity regions can cause spurious hits Low complexity region in 25-30% of the proteins Filter out low complexity in your query Filter most over-represented items in your database  There is one frequent case where the random models and therefore the statistics discussed here break down. As many as one fourth of all residues in protein sequences occur within regions with highly biased amino acid composition. Alignments of two regions with similarly biased composition may achieve very high scores that owe virtually nothing to residue order but are due instead to segment composition. Alignments of such "low complexity" regions have little meaning in any case: since these regions most likely arise by gene slippage, the one-to-one residue correspondence imposed by alignment is not valid. While it is worth noting that two proteins contain similar low complexity regions, they are best excluded when constructing alignments [42-44]. The BLAST programs employ the SEG algorithm [43] to filter low complexity regions from proteins before executing a database search. ©CMBI 2003

24 BLAST Output (1) ©CMBI 2003

25 BLAST Output (2) A low probability indicates that a match is unlikely to ave arisen by chance A high score, or preferably, clusters of high scores, indicates a likely relationship ©CMBI 2003

26 BLAST Output (3) Low scores with high probabilities suggest that matches have arisen by chance ©CMBI 2003

27 Alignment Significance in BLAST
P-value (probability) relates the score returned for an alignment to the likelihood of its having arisen by chance; in general, the closer the value approaches to zero, the greater the confidence that the match is real. E-value (expect value) the number of alignments with a given score that would be expected to occur at random in the database that has been searched (e.g. if E=10, 10 matches with scores this high are expected to be found by chance). A match will only be reported if its E value falls below the threshold set. Lower E thresholds are more stringent, and report fewer matches. The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. In BLAST 2.0, the Expect value is also used instead of the P value (probability) to report the significance of matches. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. E= The Expect value is used as a convenient way to create a significance threshold for reporting results. When the Expect value is increased from the default value of 10, a larger list with more low-scoring hits can be reported. P = probability that the HSP was generated as a chance alignment. Orthologs will have extremely significant scores DNA , protein 10-30 Closely related paralogs will have significant scores Protein 10-15 Distantly related homologs may be hard to identify Protein 10-4 Orthologs: the sequences have diverged by speciations -E.g. human, mouse and chicken  hemoglobin Paralogs: the sequences have diverged by gene duplication -E.g. the  and  hemoglobin genes ©CMBI 2003

28 BLAST Output (4) ©CMBI 2003

29 BLAST Output (5) ©CMBI 2003

30 BLAST Output (6) ©CMBI 2003


Download ppt "Step 3: Tools Database Searching"

Similar presentations


Ads by Google