 # Random Walks and BLAST Marek Kimmel (Statistics, Rice) 713 348 5255

## Presentation on theme: "Random Walks and BLAST Marek Kimmel (Statistics, Rice) 713 348 5255"— Presentation transcript:

Random Walks and BLAST Marek Kimmel (Statistics, Rice) kimmel@rice.edu 713 348 5255 kimmel@rice.edu

Outline Explaining the connection Simple RW with absorption Moment-generating function method Size and duration of excursions Renewal equation and general RW Significance of alignments in BLAST

Intuitive introduction Alignment as a random walk g g a g a c t g t a g a c g a a c g c c c t a g c c Scores: match = 1, mismatch = -1 Solid symbols = ladder points, squares = excursions

Relation to BLAST Quality of alignment reflected by the course of the RW. Distribution of maximum heights of excursions achievable by chance, provides null hypothesis.

Simple RW with absorbing boundaries We consider the case p  q only

Absorption probabilities Consider backward equation 2 nd order, homogeneous, linear, difference equ. where

Absorption probabilities This provides Constants derived from boundary conditions

Mean number of steps to absorption 2 nd order, inhomogeneous, difference equ. Solution = any particular solution of (*) + general solution of the corresponding homogeneous equ. Verify is a particular solution, and therefore

Moment-generating function approach

Simple RW: Until absorption

Moment-generating function approach Sticky argument now: At the time of absorption, But the latter is equal to 1

Stopping time (at absorption)

Asymptotics (p < q) Hypotheses So, define Y = excursion height

Asymptotics of the mean time to absorption A = Mean{# steps before absorption at -1} Since we have

Random walks versus alignments

Anatomy of an excursion Pr[Y i  y] ~ Cexp(-  *y) A= E[inter-ladder pts. distance] A and C difficult to compute

P-values for a BLAST comparison Assume comparison of two sequences of length N, with expected ladder points distance A. This gives n=N/A excursions on the average. Also, let us denote From expression (2.134) we have (since Y is geometric-like) Making substitutions we obtain

P-values for a BLAST comparison From previous slide Let us assume a normalized score Substituting into previous inequality, we obtain So, P-value, corresponding to an empirically obtained maximum score, equals

P-values for a BLAST comparison Expected value of the normalized score is equal approximately to Euler’s constant This yields Both and are invariant with respect to multiplication of the score by a constant (why?)

P-values for a BLAST comparison Expected number of excursions of height at least equal to v For an empirically found value of the score, By comparison with a previous formula we see

From the BLAST course http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html To assess whether a given alignment constitutes evidence for homology, it helps to know how strong an alignment can be expected from chance alone. In this context, "chance" can mean the comparison of (i) real but non-homologous sequences; (ii) real sequences that are shuffled to preserve compositional properties or (iii) sequences that are generated randomly based upon a DNA or protein sequence model. Analytic statistical results invariably use the last of these definitions of chance, while empirical results based on simulation and curve-fitting may use any of the definitions.

As demonstrated above, scores of local alignments are covered by a well- developed theory. For global alignments, Monte Carlo experiments can provide rough distributional results for some specific scoring systems and sequence compositions, but these can not be generalized easily. –It is possible to express the score of interest in terms of standard deviations from the mean, but it is a mistake to assume that the relevant distribution is normal and convert this Z-value into a P-value; the tail behavior of global alignment scores is unknown. –The most one can say reliably is that if 100 random alignments have score inferior to the alignment of interest, the P-value in question is likely less than 0.01. One further pitfall to avoid is exaggerating the significance of a result found among multiple tests. –When many alignments have been generated, e.g. in a database search, the significance of the best must be discounted accordingly. –An alignment with P-value 0.0001 in the context of a single trial may be assigned a P- value of only 0.1 if it was selected as the best among 1000 independent trials. From the BLAST course http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

The E-value of equation applies to the comparison of two proteins of lengths m and n. How does one assess the significance of an alignment that arises from the comparison of a protein of length m to a database containing many different proteins, of varying lengths? One view is that all proteins in the database are a priori equally likely to be related to the query. This implies that a low E-value for an alignment involving a short database sequence should carry the same weight as a low E-value for an alignment involving a long database sequence. To calculate a "database search" E-value, one simply multiplies the pairwise-comparison E-value by the number of sequences in the database. From the BLAST course http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

An alternative view is that a query is a priori more likely to be related to a long than to a short sequence, because long sequences are often composed of multiple distinct domains. If we assume the a priori chance of relatedness is proportional to sequence length, then the pairwise E-value involving a database sequence of length n should be multiplied by N/n, where N is the total length of the database in residues. Examining equation this can be accomplished simply by treating the database as a single long sequence of length N. The BLAST programs take this approach to calculating database E- value. Notice that for DNA sequence comparisons, the length of database records is largely arbitrary, and therefore this is the only really tenable method for estimating statistical significance. From the BLAST course http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

Comparison of two unaligned sequences Until now, a fixed ungapped alignment in the comparison of two sequences of length N each. Now, given two sequences of lengths N 1 and N 2 without any specific alignment (total N 1 + N 2 – 1 ungapped alignments). Theory advanced, we give only highlights of results. Many conclusions of the previous sections carry over with N substituted by N 1 N 2.

Scores The basic score is re-defined now Mean number of (independent) ladder points in all alignments Since the heights of excursions are geometric-like rv’s (n of them),

Scores From the previous slide Define standardized score Expected count of (independent) excursions of height at least y Similar expressions as before for expected score and P-value

Karlin-Altschul sum statistic Idea: Add information from the r-1 “next to the highest” excursions It was proved that The particular statistics used

Choice of r and multiple testing Usually, all sum tests are performed for all “available” r The best P-value is accepted, following heuristic corrections (see Section 9.3.4),

Comparison of a query sequence against a database Use Poisson distribution to obtain the following probability Since database is of length D, then expected # HSPs with scores  v For all other Analyze Example 9.5.2.

Download ppt "Random Walks and BLAST Marek Kimmel (Statistics, Rice) 713 348 5255"

Similar presentations