Presentation is loading. Please wait.

Presentation is loading. Please wait.

BLAST.

Similar presentations


Presentation on theme: "BLAST."— Presentation transcript:

1 BLAST

2 What is BLAST? “BLAST® (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. The BLAST programs have been designed for speed, with a minimal sacrifice of sensitivity to distant sequence relationships. The scores assigned in a BLAST search have a well-defined statistical interpretation, making real matches easier to distinguish from random background hits. BLAST uses a heuristic algorithm which seeks local as opposed to global alignments and is therefore able to detect relationships among sequences which share only isolated regions of similarity (Altschul et al., 1990).

3 Selecting BLAST programme
Compares an amino acid query sequence against a protein sequence database. blastn Compares a nucleotide query sequence against a nucleotide sequence database. blastx Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. tblastn Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. tblastx Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

4 Selecting the Database (protein)
nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR  month All new or revised GenBank CDS translation+PDB+SwissProt+PIR reased in the last 30 days.  swissprot The last major release of the SWISS-PROT protein sequence database (no updates). These are uploaded to our system when they are received from EMBL. patents Protein sequences derived from the Patent division of GenBank. yeast Yeast (Saccharomyces cerevisiae) protein sequences. This database is not to be confused with a listing of all Yeast protein sequences. It is a database of the protein translations of the Yeast complete genome. E. coli E. coli (Escherichia coli) genomic CDS translations. pdb Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank. alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences.

5 Nucleotide Databases nr
All non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences). month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days. dbest Non-redundant database of GenBank+EMBL+DDBJ EST Divisions. dbsts Non-redundant database of GenBank+EMBL+DDBJ STS Divisions. mouse ests The non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism mouse. human ests The Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism human. other ests The non-redundant database of GenBank+EMBL+DDBJ EST Divisions all organisms except mouse and human. yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences. Not a collection of all Yeast nucelotides sequences, but the sequence fragments from the Yeast complete genome. E. coli E. coli (Escherichia coli) genomic nucleotide sequences.

6 Entering your Sequence
The BLAST web pages accept input sequences in three formats; FASTA sequence format, NCBI Accession numbers, or GIs.

7 Is the query sequence represented in the database?
Setting Up a Query Is the query sequence represented in the database? Choose a current nucleic acid database. Select from among organism-specific (e.g.: yeast), inclusive (e.g., nonredundant), or specialized set (e.g., dbEST, dbSTS, GSS, HTG) databases blastn

8 Are there homologs or evolutionary relatives of the query sequence in the database? Are there proteins whose function is related to the query sequence? Choose a protein database if the query is protein or DNA expected to encode a protein because amino acid searches are more sensitive blastp for amino acid queries; blastx for translated nucleic acid queries. Use Tblastn or tblastx for comparisons of an amino acid or translated nucleic acid query versus a translated nucleic acid database.

9 search parameters Default Special Cases Short Query
Default Special Cases Short Query Large Sequence Family Ungapped BLAST Filter on off Scoring Matrix BLOSUM62 PAM30 for 35 and under Word Size 3 3, or reduce to 2 E value 10 1000 or more Gap costs 11,1 4 Alignments 50 2000

10 Filter The default setting will filter repetitive or low-complexity sequences from the query using the SEG (protein) or DUST (nucleic acid) programs If a low complexity region in the query is of interest, filtering will need to be turned off. If the number of hits returned is small when searching with a short query, it may help to re-search with filtering turned off. The Human repeat filter option human repeats such as LINEs and SINEs and is especially useful for human sequences that may contain these repeats.

11 which that substitution is known to occur within conserved blocks of
Scoring Matrices BLOSUM62 is the default matrix. The BLOSUM matrix assigns a probability score for each position in an alignment that is based on the frequency with which that substitution is known to occur within conserved blocks of related proteins. BLOSUM62 has been empirically shown to be among the best for detecting weak protein similarities Other supported options include PAM30, PAM70, BLOSUM80, and BLOSUM45. Short query sequences can only produce short alignments, and therefore database searches with short queries should use an appropriately tailored matrix . The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries, so the older PAM matrices may be used instead.

12 Gap opening and gap extension penalties
No alternate scoring matrices are available for BLASTN Gap opening and gap extension penalties BLAST program Default Gap Penalty (G) Default Gap Extension Penalty (E) Other supported (G) and (E) values blastp -11 -1 -10, -1; -10, -2; -11, -1; -8, -2; -9, -2 blastn -5 -2 none

13 E value threshold The E value for an alignment score "S" represents the number of hits with a score equal to or better than "S" that would be "expected" by chance (the background noise) when searching a database of a particular size. The default E value for blastn, blastp, blastx and tblastn is 10. At this setting, 10 hits with scores equal to or better than the defined alignment score, S, are expected to occur by chance (in a search of the database using a random query with similar length). Increase the E value to 1000 or more when searching with a short query, since it is likely to be found many times by chance in a given database.

14 Alignments If the number of alignments requested (x) is fewer than those exceeding the significance threshold only the top (x) hits will be reported. To detect low-similarity matches, the number of alignments to be shown should be increased when searching with a member of a large sequence family.

15 Analyzing the output Step 1.  Examine the alignment scores and statistics Scores for each position of an alignment are derived from a substitution matrix The raw score "S" of the alignment is usually calculated by summing the scores for each letter-to-letter and letter-to-null position in the alignment. The bit score is calculated from the raw score by normalizing with the statistical variables that define a given scoring system. Therefore, bit scores from different alignments, even those employing different scoring matrices can be compared.

16 The higher the score the better the alignment
There is no widely accepted theory for selecting gap costs. It is rarely necessary to change gap opening or extension Values from the default.

17 Statistics Local alignments with no gaps are referred to as High scoring pairs (HSPs). For gapped alignments, the significance of a given alignment with score S is represented by the E (Expect) value (shown in the right-most column in the output), the expected number of chance alignments with a score of S or better. The E value decreases exponentially as the Score (S) that is assigned to a match between two sequences increases. A convenient way to create a significance threshold for reporting hits is to alter the E value. When the Expect value threshold is increased from the default value of 10, more hits can be reported.

18 Step 2. Examine the alignments
Descriptions The highest scoring alignments are described by one line summaries called "descriptions". The description lines are sorted by increasing E value, thus the most significant alignments (lowest E values) are at the top.

19 Graphic Representation

20 At the top is a linear map of the query. Each bar drawn
below the map represents a protein (or protein fragment) that matches the query sequence. The position of each bar relative to the linear map of the query allows the user to see instantly the extent to which the database matches align with a single or multiple regions of the query. The most similar hits are shown at the top in red. Pink, green, blue and black bars follow, representing proteins in decreasing order of similarity.

21 PSI_BLAST Position-Specific Iterative (PSI) BLAST is a program based on the BLAST 2.0 algorithm that is designed to detect weak relationships between the query and members of the database not necessarily detectable by standard BLAST searches. The added sensitivity of this program over regular BLAST comes from the use of a profile that is constructed (automatically) from a multiple alignment of the highest scoring hits in the initial BLAST search. A highly conserved position will receive a high score and weakly conserved positions receive scores near zero. The profile is then used to perform additional BLAST searches (called iterations) and the results of each iteration used to refine the profile.

22 PSI-BLAST analysis is useful both for identifying the distant members of a
protein family, whose relationship is not recognizable by straight sequence comparison, and also for deducing the function of hypothetical proteins that are unannotated in the database. A PSI-BLAST query is identical to a BLAST query with added specification by the use of the expectation (E) value cut-off for inclusion of a match in the first and subsequent iterations. The initial PSI-BLAST search uses the same matrix options available for Gapped BLAST, since it is a Gapped BLAST search. The user can continue to search iteratively until satisfied that no new matches will be identified. The point at which no new hits are identified by additional searches is known as "convergence".

23 Motif searching with PHI-BLAST
A new service called Pattern Hit Initiated BLAST (PHI-BLAST), that searches for particular patterns in protein queries is now available in Version 2.0 of the BLAST program suite. PHI-BLAST expects as input a protein query sequence and a pattern contained in that sequence. PHI-BLAST searches the specified database for other protein sequences that also contain the input pattern and have significant similarity to the query sequence in the vicinity of the pattern occurrences. PHI-BLAST is integrated with Position-Specific Iterated BLAST (PSI-BLAST), so that the results of a PHI-BLAST query can be used to initiate one or more rounds of PSI-BLAST searching.


Download ppt "BLAST."

Similar presentations


Ads by Google