Presentation on theme: "Rationale for searching sequence databases"— Presentation transcript:
1 Rationale for searching sequence databases May 11, 2004Writing projects due May 25Quiz #3 on Thurs., May 20Learning objectives-Why do we search sequence databases? Understand the Smith-Waterman algorithm of local alignment and the concept of backtracing. FASTA and BLAST programs. Psi-BlastWorkshop-Use of Psi-BLAST to determine sequence similarities.Homework-Due May 20
2 Why search sequence databases? 1. I have just sequenced a gene. What is known about the gene I sequenced?2. I have a unique sequence. Is there similarity to another gene that has a known function?3. I found a new gene in a lower organism. Is it similar to a gene from another species?4. I have decided to work on a new gene. The people in the field will not give me the plasmid. I need the complete cDNA sequence to perform PCR.
3 Perfect Searches First “hit” should be an exact match. Next “hits” should contain all of the genes that are related to your gene (homologs)Next “hits” should be similar but are not homologs
4 How does one achieve the “perfect search”? Comparison Matrices (PAM vs. BLOSUM)Database Search AlgorithmsDatabasesSearch ParametersExpect Value-change threshold for score reportingTranslation-of DNA sequence into proteinFiltering-remove repeat sequences
5 Smith-Waterman Algorithm Advances in Applied Mathematics, 2:482-489 (1981) The Smith-Waterman algorithm is a local alignment tool usedto obtain sensitive pairwise similarity alignments. Smith-Watermanalgorithm uses dynamic programming. Operating via a matrix,the algorithm uses backtracing and tests alternative paths tothe highest scoring alignments, and selects the optimal path asthe highest ranked alignment. The sensitivity of theSmith-Waterman algorithm makes it useful for finding localareas of similarity between sequences that are too dissimilar foralignment. The S-W algorithm uses a lot of computer memory.BLAST and FASTA are other search algorithms that use someaspects of S-W.
6 Smith-Waterman (cont. 1) a. It searches for both full and partial sequence matches .b. Assigns a score to each pair of amino acids-uses similarity scores-uses positive scores for related residues-uses negative scores for substitutions and gapsc. Initializes edges of the matrix with zerosd. As the scores are summed in the matrix, any sum below 0 isrecorded as a zero.e. Begins backtracing at the maximum value foundanywhere in the matrix.f. Continues the backtrace until the score falls to 0.
7 Smith-Waterman (cont. 2) H E A G A W G H E EPut zeros onborders. Assign initial scoresbased on a scoringmatrix. Calculate new scores based onadjacent cell scores.If sum is less thanzero or equal to zerobegin new scoringwith next cell.PAWHE
8 Smith-Waterman (cont. 3) H E A G A W G H E EPAWHEBegin backtrace at themaximum value foundanywhere on the matrix.Continue the backtraceuntil score falls to zeroAWGHE|| ||AW-HEScore=28
9 Calculation of percent similarity A W G H EA W - H EBlosum45 SCORES-3GAP EXT. PENALTY% SIMILARITY =NUMBER OF POS. SCORESDIVIDED BY NUMBER OF AAsIN REGION x 100% OVERALL SIMILARITY =NUMBER OF POS. SCORESDIVIDED BY NUMBER OF TOTAL AAsIN REGION x 100% SIMILARITY = 4/5 x 100= 80%%OVERALL SIMILARITY = 4/5 x 100= 80%Similarity Score = 28
10 FASTA (Pearson and Lipman 1988) This is a combination of word search and Smith-Waterman algorithmThe query sequence is divided into small words of certain size.The initial comparison of the query sequence to the database is performed using these “words”.If these “words” are located on the same diagonal in an array the region surrounding the diagonals are analyzed further.Search time is only proportional to size of database not (database*query sequence)
11 The FASTA program is the uses Hash tables. These tables speed the process of word search.Query Sequence = TCTCTC(position number)Database Sequence = TTCTCTC(position number)You choose to use word size = 4 for yourtable (total number of words in your table is44 = 256)?Sequence (totalof 256)Position w/in queryPosition w/in DBOffset (Q minus DB)TCTC , , or -3 or 1CTCTTTCT
12 FASTA Steps 2 1 4 3 Local regions of Rescore the local regions Different offset values21Identical offsetvalues in acontiguous sequenceDiagonals are extendedLocal regions ofidentity are foundRescore the local regionsusing PAM or Blos. matrix43Create a gapped alignment ina narrow segment and thenperform S-W alignmentEliminate short diagonalsbelow a cutoff score
13 Summary of FASTA steps1. Analyzes database for identical matches that are contiguous (between 5 and 10 amino acids in length (same offset values)).2. Longest diagonals are scored again using the PAM matrix (or other matrix). The best scores are saved as “init1” scores.3. Short diagonals are removed.4. Long diagonals that are neighbors are joined. The score for this joined region is “initn”. This score may be lower due to a penalty for a gap.5. A S-W dynamic programming alignment is performed around the joined sequences to give an “opt” score.Thus, the time-consuming S-W step is performed only on top scoring sequences
14 The ktup valueThe ktup (for k-tuples) value stands for the length of the wordused to search for identity.For proteins a ktup value of 3 would give a hash table of 203elements (8000 entries).The higher the ktup value the less likely you will get a match unless it is identical (remember the dot plots).The lower the ktup value the more background you will haveThe higher the ktup value the faster analysis (fewer diagonals).The following rules typically apply when using FASTA:ktup analysis____________________proteins- distantly relatedproteins- somewhat related (default)DNA-default
15 FASTA Versions FASTA-nucleotide or protein sequence searching FASTx/-compares a translated DNA query sequenceFASTy to a protein sequence database (forwardor backward translation of the query)tFASTx/-compares protein query sequence totFASTy DNA sequence database that has beentranslated into three forward and threereverse reading frames
16 FASTA Statistical Significance A way of measuring the significance of a score considers the meanof the random score distribution.The difference between the similarity score for your single alignmentand the mean of the random score distribution is normalized bythe standard deviation of that random scoredistribution. This is the Z-score.Higher Z-scores are better becausethe further the real score is from this mean (in standard deviation units)the more significant it is.
17 FASTA Statistical Significance Z score for a single alignment=(similarity score - mean score from database)standard deviation from database( scores)2 scores2 -Stand. Dev. =Total#ofSequencesTotal#ofSequences
18 Mean similarity scores of complete databaseMean similarity scoresof related records
19 FASTA statistics (cont.) Using the distribution of the z-scores in the database, the FastAprogram can estimate the number of sequences that wouldbe expected to produce, purely by chance, a z-score greater than orequal to the z-score obtained in the search.This is reported as the E() value. This value isthe number of sequences you would expect to find with this score bysearching a database of random sequences.Thus, when z the E()
20 Evaluating the Results of FASTA BestSCORES Init1: Initn: Opt: 2847z-score: E(): 1.4e-138Smith-Waterman score: 2847; 100.0% identity in 413 overlapGoodSCORES Init1: 719 Initn: 748 Opt: 793z-score: E(): 3.8e-34Smith-Waterman score: 796; 41.3% identity in 378 overlapMediocreSCORES Init1: 249 Initn: 304 Opt: 260z-score: E(): 8.3e-07Smith-Waterman score: 270; 35.0% identity in 183 overlap
21 BLAST Basic Local Alignment Search Tool Speed is achieved by: Pre-indexing the database before the searchParallel processingUses a hash table that contains neighborhood words rather than just random words.
22 Neighborhood wordsThe program declares a hit if the word taken from the query sequence has a score >= T when a scoring matrix is used.This allows the word size (W (this is similar to ktup value)) to be kept high (for speed) without sacrificing sensitivity.If T is increased by the user the number of background hits is reduced and the program will run faster
24 Comparison MatricesIn general, the BLOSUM series is thought to be superior to thePAM series because it is derived from areas of conserved sequences.It is important to vary the parameters when performing a sequencecomparison. Similarity scores for truly related sequences areusually not sensitive to changes in scoring matrix and gap penalty.Thus, if your “hits list” holds up after changing these parametersyou can be more sure that you are detecting similar sequences.
25 Which Program should one use? Most researchers use methods for determining local similarities:Smith-Waterman (gold standard)FASTABLAST}Do not find every possible alignmentof query with database sequence. Theseare used because they run faster than S-W
26 What are the different BLAST programs? compares an amino acid query sequence against a protein sequence databaseblastncompares a nucleotide query sequence against a nucleotide sequence databaseblastxcompares a nucleotide query sequence translated in all reading frames against a protein sequence databasetblastncompares a protein query sequence against a nucleotide sequence database dynamically translated in all reading framestblastxcompares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx program cannot be used with the nr database on the BLAST Web page.
27 When to use the correct program Problem Program ExplanationIdentifyUnknownProteinBLASTP;FASTA3General protein comparison. Use ktup=2 for speed; ktup=1 for sensitive search.Smith-WatermanSlower than FASTA3 and BLAST but provides maximum sensitivityTFASTX3;TFASTY3;TBLASTNUse if homolog cannot be found in protein databases; Approx. 33% slowerPsi-BLASTFinds distantly related sequences. It replaces the query sequence with a position-specific score matrix after an initial BLASTP search. Then it uses the matrix to find distantly related sequences
28 When to use the correct program (cont. 1) Problem Program ExplanationIdentifyneworthologsTFASTX3;TFASTY3TBLASTN:TBLASTXUse PAM matrix <=20 or BLOSUM90 to avoid detecting distant relationships. Search EST sequences w/in the same species.Always attempt to translate your sequence into protein prior to searching.IdentifyESTSequenceFASTX3;FASTY3;BLASTX;TBLASTXIdentifyDNASequenceFASTA;BLASTNNucleotide sequence comparision
29 Choosing the databaseRemember that the E value increases approximately linearly with database size.When searching for distant relationships always use the smallest database likely to contain the homolog of interest.Thought problem: If the E-value one obtains for a search is 12 in Swiss-PROT and the E-value one obtains for same search is 74 in PIR how large is PIR compared to Swiss-PROT?74/12 = ~6
30 Filtering Repetitive Sequences Over 50% of genomic DNA is repetitiveThis is due to:retrotransposonsALU regionmicrosatellitescentromeric sequences, telomeric sequences5’ Untranslated Region of ESTsExample of ESTs with simple low complexity regions:T27311GGGTGCAGGAATTCGGCACGAGTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCTC
31 Filtering Repetitive Sequences (cont. 1) Programs like BLAST have the option of filtering out low complex regions.Repetitive sequences increase the chance of a match during a database search
32 PSI-BLAST PSI-position specific iterative a position specific scoring matrix (PSSM) is constructed automatically from multiple HSPs of initial BLAST search. Normal E value is usedThis PSSM is as the new scoring matrix for a second BLAST search. Low E value is used E=.001.Result-1) obtain distantly related sequences2) find out the important residues that provide function or structure.