C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 8 Database searching (2)

PSI-BLAST Position-Specific Iterated BLAST A profile (called PSSM by BLAST – Position Specific Scoring Matrix) is derived from the result of the first search (using a single query sequence) Database is searched against the profile (instead of a sequence) in subsequent rounds Up to 3-10 iterations are recommended

1.Query sequences are first scanned for the presence of so- called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition likely to lead to spurious hits; are excluded from alignment. 2.The program then initially operates on a single query sequence by performing a gapped BLAST search 3.Then, the program takes significant local alignments (hits) found, constructs a multiple alignment (master-slave alignment) and abstracts a position-specific scoring matrix (PSSM) from this alignment. 4.The database is rescanned in a subsequent round, now using the PSSM, to find more homologous sequences. Iteration continues until user decides to stop or the search has converged PSI-BLAST steps in words

Profile a Profile is a generalized form of sequence probabilities instead of a letter ACDWYACDWY 0.3 0.1 0..   0.3 0.5 0.  0 0.5 0 0.5 0.2..   0.1 0.2 0.0 0.1..   0.4 0.3...

Constructing a profile Take significant BLAST hits Make an alignment Assign weights to sequences Construct profile ACDWYACDWY 0.3 0.1 0..   0.3 0.5 0.  0 0.5 0 0.5 0.2..   0.1 0.2 0.0 0.1..   0.4 0.3...

PSI BLAST: Constructing the Profile Matrix Figure from: Altschul et al. Nucleic Acids Research 25, 1997

12345Overall A.17.33.17 6/30 =.20 C.17.50 9/30 =.30 G.50.17 7/30 =.23 T.17.33.50.17 8/30 =.27 S1 GCTCC S2 AATCG S3 TACGC S4 GTGTT S5 GTAAA S6 CGTCC 12345Overall A.851.65.85 6/30 =.20 C.57 1.67 9/30 =.30 G2.17.74 7/30 =.23 T.631.221.85.63 8/30 =.27 12345 A-0.230.72-0.23 C-0.81 0.74 G1.11-0.43 T-0.660.290.89-0.66 Normalise by dividing by overall frequencies Convert to log to base of 2 12345 A-0.230.72-0.23 C-0.81 0.74 G1.11-0.43 T-0.660.290.89-0.66 Match GATCA to PSSM Score = 1.11 + 0.72 + 0.89 + 0.74 - 0.23 = 3.23 Find nucleotides at corresponding positions Sum corresponding log odds matrix scores (A) (B) Profile calculation example using frequency normalisation and log conversion profile 12345

PSI BLAST: Determining profile elements more reliably using pseudo-counts the value for a given element of the profile matrix is given by: where the probability of seeing amino acid a i in column j is estimated as: Observed frequency Pseudocount (e.g. database frequency) e.g.  = number of sequences in profile,  =1 Overall alignment frequency (preceding slide)

PSI BLAST: Determining profile elements more reliably using pseudo-counts Pseudo-counts: mix observed a.a. frequencies with prior (e.g. database) frequencies drawback is pulling all frequencies to prior frequencies, which reduces the differences between sequences are useful when multiple alignment contains only few sequences so that there is no statistical sample per column yet with greater numbers of sequences in the MSA, the profile becomes less dependent on priors

PSI-BLAST iteration graphic… Q ACD..YACD..Y Query sequence PSSM Q Query sequence Gapped BLAST search Database hits Gapped BLAST search ACD..YACD..Y PSSM Database hits xxxxxxxxxxxxxxxxx iterate Low-complexity region

DBT hits PSSM Q Discarded sequences Run query sequence against database Run PSSM against database Another PSI-BLAST iteration graphic…

(A)(B) (C)(D) Figure 6

PSI-BLAST entry page Paste your query sequence Switch this off for default run

1 - This portion of each description links to the sequence record for a particular hit. 2 - Score or bit score is a value calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment. Each score links to the corresponding pairwise alignment between query sequence and hit sequence (also referred to as subject sequence). 3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will occur in the database by chance. The smaller the E Value, the more significant the alignment. For example, the first alignment has a very low E value of e -117 meaning that a sequence with a similar score is very unlikely to occur simply by chance. 4 - These links provide the user with direct access from BLAST results to related entries in other databases. ‘L’ links to LocusLink records and ‘S’ links to structure records in NCBI's Molecular Modeling DataBase.

‘ X’ residues denote low-complexity sequence fragments that are ignored

PHI-BLAST (Pattern Hit Initiated) Method to find database sequences based on a given sequence pattern Sequence patterns are found using regular expressions (comes later in this course) in PROSITE format (PROSITE is a database of functional sequence motifs expressed as regular expressions or profiles) Occurrences need not exceed threshold T Extension and alignment are the same as in Gapped (or PSI-) Blast Scoring scheme is different (stricter) due to the importance of pattern occurrence

FASTA First homology searching method (FASTP – Lipman & Pearson, 1985) Until gapped-BLAST and PSI-BLAST appeared (1997), FASTA has been the better method Compares a given query sequence with a library of sequences and calculates for each pair the highest scoring local alignment Speed is obtained by delaying application of the dynamic programming technique to the moment where the most similar segments are already identified by faster and less sensitive techniques FASTA routine operates in four steps:

FASTA Operates in four steps: 1.Rapid searches for identical words of a user specified length occurring in query and database sequence(s) (Wilbur and Lipman, 1983, 1984). For each target sequence the 10 regions with highest density of ungapped common words are determined. 2.These 10 regions are rescored using Dayhoff PAM-250 residue exchange matrix (Dayhoff et al., 1983) and the best scoring region of the 10 is reported under init1 in the FASTA output. 3.Regions scoring higher than a threshold value T and being sufficiently near each other in the sequence are joined, now allowing gaps. The highest score of these new fragments can be found under initn in the FASTA output. T is set such that only a small fraction of database sequences are retained. These are the only ones that are reported to the user. Until here things are quick!

FASTA Operates in four steps (continued): 4.full dynamic programming alignment (Chao et al., 1992) over the final region which is widened by 32 residues at either side, of which the score is written under opt in the FASTA output. This is slow – O(n 2 )

FASTA output example DE METAL RESISTANCE PROTEIN YCF1 (YEAST CADMIUM FACTOR 1).... SCORES Init1: 161 Initn: 161 Opt: 162 z-score: 229.5 E(): 3.4e-06 Smith-Waterman score: 162; 35.1% identity in 57 aa overlap 10 20 30 test.seq MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLE :| :|::| |:::||:|||::|: | YCFI_YEAST CASILLLEALPKKPLMPHQHIHQTLTRRKPNPYDSANIFSRITFSWMSGLMKTGYEKYLV 180 190 200 210 220 230 40 50 60 test.seq LSDIYQIPSVDSADNLSEKLEREWDRE :|:|::| |:::||:|||::|: | YCFI_YEAST EADLYKLPRNFSSEELSQKLEKNWENELKQKSNPSLSWAICRTFGSKMLLAAFFKAIHDV 240 250 260 270 280 290

FASTA (1) Rapid identical word searches: Searching for k-tuples of a certain size within a specified bandwidth along search matrix diagonals. For not-too-distant sequences (> 35% residue identity), little sensitivity is lost while speed is greatly increased. Technique employed is known as hash coding or hashing: a lookup table is constructed for all words in the query sequence, which is then used to compare all encountered words in each database sequence.

HASHING (general) rapid identical word searches a lookup table is constructed for all words in the query sequence, which is then used to compare all encountered words in each database sequence Example of hashing: the telephone book to find persons’ phone numbers (names are ordered) -you do not need to search through all names until you find the person you want -In computer speak: find a function f such that f(name) can be directly assigned to address in computer, where the telephone number is stored

HASHING (cont.).. 0044 20 84453759 ‘Jones, D.A.’ 0044 20 84457643 ‘Mill, J.’ 0044 51 27655423 ‘Anson, F.P.L’ This takes too long……

HASHING (cont.) Name = ‘Jones’ F(‘Jones’) 0044 20 84453759 Hash array For sequences: -name is subword in database sequence -telephone number is sequence position(s) of subword

HASHING (cont.) Name = ‘Jones’ F(‘Jones’).. 0044 20 84453759 ‘Jones, D.A.’ Hash function should avoid clashes: -clashes take more time -but need less memory for hash array Hash array clashes

HASHING (cont.) What do you think about this function? Will there be clashes? Example of hash function: Take position of letter in alphabet (p(a)=1, p(b)=2, p(c)=3,..) F(‘Jones’) = p(J)+p(o)+p(n)+p(e)+p(s) = 10+15+14+5+19=63 So, ‘Jones’ goes to slot 63 in Hash array

HASHING in FASTA Sequence positions in query are hashed Query:ERLFERLAC ……… DB:MERIFERLAC ……… Query hash table: WordPosition ER1, 5 RL2, 6 LF3 FE4 LA7 AC8 ….… You only need to go through the DB sequence once: for each word encountered (ME, ER, RI, IF,..), check the query hash list for the word. If found, you immediately have the query sequence positions of the word. You also know the position you are at in the DB sequence, and so you can fill in the m*n matrix with diagonals (see earlier slide step 1). Algorithmic speed therefore is linear with (DB) sequence length or O(n). Compare this to finding all word match positions without a hash list (complexity is O(m*n)).

Query: ERLFERLAC 123456789 Query hash table: WordHashPosition ER3*20+141, 5 RL14*20+92, 6 LF9*20+43 FE4*20+34 LA9*20+07 AC0*20+18 8 … 1 … 4 … 7 … 3 … 2 … 0 1 … 74 …. 83 …. 180 … 184 … 289 … 5 6 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 DB: MERIFERLAC E=3, R=14 h(‘ER’)=3  20+14 One-let codes: ACDEFGHIKLMNPQRSTVWY 0 5 10 15 Chaining array Hash array

FASTA The k-tuple length (step 1) is user-defined and is usually 1 or 2 for protein sequences (i.e. either the positions of each of the individual 20 amino acids or the positions of each of the 400 possible dipeptides are located). For nucleic acid sequences, the k-tuple is 5-20 (often 11), and should be longer because short k-tuples are much more common due to the 4 letter alphabet of nucleic acids. The larger the k-tuple chosen, the more rapid but less thorough, a database search.

Scoring BLAST alignments Score should optimise the chance to select proper hits (True Positives) Scoring alignments is dependent on –The scoring system used (residue exchange matrix and gap penalty regime) –Characteristics of the sequence database (size, residue composition) The BLAST way of scoring has been adopted by other methods as well; e.g., some implementations of FASTA, etc.

Alignment Bit Score S is the raw alignment score The bit score (‘bits’) B has a standard set of units The bit score B is calculated from the number of gaps and substitutions associated with each aligned sequence. The higher the score, the more significant the alignment and K and are statistical parameters associated with a given scoring system (e.g. BLOSUM62 in Blast) See Altschul and Gish, 1996, for a collection of values for and K over a set of widely used scoring matrices. Because bit scores are normalized with respect to the scoring system, they can be used to compare alignment scores from different searches based on different scoring schemes (a.a. exchange matrices) B = ( S – ln K) / ln 2

What is the statistical significance of an alignment To get a null model: extract local alignments from random sequences P-value –The probability of obtaining the result by pure chance –An alignment giving a lower P-value than a threshold value set by the user is considered a hit.

Normalised sequence similarity The p-value is defined as the probability of seeing at least one unrelated score S greater than or equal to a given score x in a database search over n sequences. This probability follows the Poisson distribution (Waterman and Vingron, 1994): P(x, n) = 1 – e -n  P(S  x), where n is the number of sequences in the database Depending on x and n (fixed)

E-value The concept of P-value applies to single comparisons What with searching in a large database? Task. Having a protein, we want to find similar ones in a large database (4 mln sequences in NR DB). We are interested in P-value < 0.01 Count the number of hits we’ll get by chance alone.

Normalised sequence similarity Statistical significance The E-value is defined as the expected number of non- homologous sequences with score greater than or equal to a score x in a database of n sequences: E(x, n) = n  P(S  x) For example, if E-value = 0.01, then the expected number of random hits with score S  x is 0.01, which means that this E- value is expected by chance only once in 100 independent searches over the database. if the E-value of a hit is 5, then five fortuitous hits with S  x are expected within a single database search, which renders the hit not significant.

A model for database searching score probabilities Scores resulting from searching with a query sequence against a database follow the Extreme Value Distribution (EVD) (Gumbel, 1955). Using the EVD, the raw alignment scores are converted to a statistical score (E value) that keeps track of the database amino acid composition and the scoring scheme (a.a. exchange matrix)

Extreme Value Distribution Probability density function for the extreme value distribution resulting from parameter values  = 0 and = 1, [y = 1 – exp(-e -x )], where  is the characteristic value (where the EVD peaks) and is the decay constant. y = 1 – exp(-e - (x-  ) )

Extreme Value Distribution (EVD) You know that an optimal alignment of two sequences is selected out of many suboptimal alignments, and that a database search is also about selecting the best alignment(s). This bodes well with the EDV which has a right tail that falls off more slowly than the left tail. Compared to using the normal distribution, when using the EDV an alignment has to score further away from the expected mean value to become a significant hit. real data EDV approximation

Extreme Value Distribution The probability of a score S to be larger than a given value x can be calculated following the EDV as: E-value: P(S  x) = 1 – exp(-e - (x-  ) ), where  =(ln Kmn)/, and K a constant that can be estimated from the background amino acid distribution and scoring matrix (see Altschul and Gish, 1996, for a collection of values for and K over a set of widely used scoring matrices).

Extreme Value Distribution Using the equation for  (preceding slide), the probability for the raw alignment score S becomes P(S  x) = 1 – exp(-Kmne - x ). In practice, the probability P(S  x) is estimated using the approximation 1 – exp(-e -x )  e -x, which is valid for large values of x. This leads to a simplification of the equation for P(S  x): P(S  x)  e - (x-  ) = Kmne - x. The lower the probability (E value) for a given threshold value x, the more significant the score S.

Normalised sequence similarity Statistical significance Database searching is commonly performed using an E-value in between 0.1 and 0.001. Low E-values decrease the number of false positives in a database search, but increase the number of false negatives, thereby lowering the sensitivity of the search.

Words of Encouragement “There are three kinds of lies: lies, damned lies, and statistics” – Benjamin Disraeli “Statistics in the hands of an engineer are like a lamppost to a drunk – they’re used more for support than illumination” “Then there is the man who drowned crossing a stream with an average depth of six inches.” – W.I.E. Gates

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 8 Database searching (2)

Similar presentations

Presentation on theme: "C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 8 Database searching (2)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 8 Database searching (2)

Similar presentations

Presentation on theme: "C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 8 Database searching (2)"— Presentation transcript:

Similar presentations

About project

Feedback