Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pattern and string matching tools Biology 162 Computational Genetics Todd Vision 9 Sep 2004.

Similar presentations


Presentation on theme: "Pattern and string matching tools Biology 162 Computational Genetics Todd Vision 9 Sep 2004."— Presentation transcript:

1 Pattern and string matching tools Biology 162 Computational Genetics Todd Vision 9 Sep 2004

2 Some more pattern and string matching tools Simple signatures –Logos –Position-specific Scoring Matrices –PSI-BLAST Regular expressions Suffix trees

3

4 Sequence logos Entropy of column j denoted H j Information content denoted I j How to draw a logo –Height of column given by I j –Height of each symbol = f ij x I j

5 Information content Information/Uncertainty is expressed in bits –There is a natural relationship to log base 2 Imagine 64 shells, under one of which is a ball. –6 guesses are required to find the ball –In this case, maximal uncertainty is log 2 64=6 bits In the case of 20 amino acids, maximal uncertainty is log 2 20=4.32 bits.

6 Position-Specific Scoring Matrix Constructed from conserved columns of a MSA Log odds scores for each residue in each column, based on –Frequency of residue within column –Background frequency of residues Takes advantage of the fact that columns differ in –Composition –Levels of conservation

7 Position Specific Scoring Matrix pos con A R N D C … A R N D C … Inf Pseu 1 M -1 -3 -3 -4 -1 … 0 0 0 0 0 … 0.50 0.16 2 W -3 -3 -4 -5 -3 … 0 0 0 0 0 … 2.32 0.26 3 I -1 -3 -2 -3 7 … 0 0 0 36 0 … 0.71 0.26 4 L -2 -3 -2 -3 -3 … 0 0 0 0 0 … 0.47 0.35 5 A 4 -2 -2 -2 -2 … 56 0 0 0 0 … 0.52 0.35 PSI-BLAST PSSM for DSCAM

8 Pseudocounts If a residue is never seen in a particular column in of a MSA –What is the probability of ever seeing it there? –Not really zero… Pseudocounts are added to actual counts to account for uncertaintly in column frequencies Many methods –Laplace’s Rule Add one to every count Psudocounts grow less important as sample size gets large –Methods related to Bayesian priors - we will see later

9 Calculating scores in a PSSM S ij is score for residue i at position j x ij is position-specific count of residue i f i is background frequency of residue i b ij are pseudocounts N sequences in alignment

10 PSI-BLAST Can identify more distant homologs than possible via pairwise BLAST Iterative BLAST –After 1st iteration, multiple alignment is computed for query and top matches –PSSM generated from alignment –PSSM used for subsequent iterations –PSSM refined each iteration

11 PSI-BLAST Once high-scoring words are generated from PSSM, algorithm proceeds as before –Still very fast and K must be recalculated for each iteration

12 Regular Expressions (regex) Can be thought of as a non-probabilistic rule for generating (or matching) a pattern Used for –DNA/Protein signatures (e.g. Prosite) –Text parsing (e.g. in Perl)

13 Prosite regexes ID CBD_FUNGAL; PATTERN. AC PS00562; DT DEC-1991 (CREATED); NOV-1997 (DATA UPDATE); JUL-1998 (INFO UPDATE). DE Cellulose-binding domain, fungal type. PA C-G-G-x(4,7)-G-x(3)-C-x(5)-C-x(3,5)-[NHG]-x-[FYWM]-x(2)-Q-C In Perl regex syntax: CGG\w{4,7}G\w{3}C\w{5}C\w{3,5}[NHG]\w[FYWM]\w{2}QC In words: C followed by G followed by G followed by any 4 to 7 letters followed by G followed by any 3 letters followed by C followed by any 5 letters followed by C followed by an 3 to 5 letters followed by one of N, H or G, followed by any letter followed by one of F, Y, W, or M followed by any two letters followed by Q followed by C

14 Perl regex metacharacters [ ] - character class (e.g. [abc] = a, b or c) {min, max} - quantifiers {exactly} * - repetition, zero or more + - repetition, one or more ? - optional, zero or one. - wildcard (any character) ( ) - capture or delimit substrings | - alternation (e.g. (a|b) = either a or b)

15 Regular expressions PatternMatches a[bc]d abd, acd ab{2,5}cabc, abbc, … abbbbbc ab*cac, abc, abbc, … ab+cabc, abbc, … ab?cac, abc a(bc|de)abc, ade

16 Regular expressions: limitations Non-probabilistic: all matches match equally well –Hidden Markov models improve upon this Cannot model dependencies among different positions –Neither can HMMs –For RNA matches, where dependencies matter, we need to allow more complex rules

17 Chomsky hierarchy of transformational grammars: a preview General theory for modelling strings of symbols used in linguistics –Regular grammars –Context-free grammars –Context-sensitive grammars –Unrestricted grammars Regular grammars (like regexes) are easy to parse, but are structurally limited We will see context sensitive grammars for modelling RNA sequences

18 Suffix Trees Data structure used for fast matching of sequence patterns Helps to explain how BLAST can find word matches so fast Commonly used for –Exact matching –Identifying repeated sequences

19 Suffix Trees Rooted, directed tree for string S |S| = m leaves, labeled 1..m Edges labelled with substrings of S Internal node has at most one edge for each symbol in alphabet Concatenation of edge labels on path from root to leaf i equals suffix S[ 1..m]

20 Suffix Trees: An Example S = ‘gatgac’ root 365241 tgac c a c ga tgac c

21 Least common ancestor LCA corresponds to shared prefix of suffix (e.g. path labeled ‘ga’ for nodes 1 and 4) LCA can be retrieved in constant time root 365241 tgac c a c ga tgac c

22 If suffix trees are the answer, what is the question? Rapid word matching Find all occurrences of ‘ga’ in S = ‘gatgac’ root 365241 tgac c a c ga tgac c

23 If suffix trees are the answer, what is the question? Longest common substring problem Find the starting positions, length and identity of the longest substring that occurs in both S 1 and S 2 S 1 = ‘gatgac’ S 2 = ‘gatcac’ root 365241 gac c a c ga cac c 1 t gac 2 t cac 3 t 4 ac 56

24 If suffix trees are the answer, what is the question? Find all direct palindromes (a substring concatenated with its reverse) in S=‘agattagct’ Observation –Let S r =‘tcgattaga’ –If a palindrome is centered between q and q+ 1 of S, then it is also centered between m-q and m-q+ 1 of S r. Solution –Construct joint suffix tree for S and S r, find least common ancestor for all pairs q+ 1, n-q+ 1

25 Myriad uses for suffix trees Direct and inverted repeats –Microsatellites –Transposons Inverted palindromes –Restriction enzyme recognition sites Imperfect matches Algorithmic efficiency –Many efficient algorithms for traversing suffix trees –The trees themselves can be constructed in O(m) time

26

27 Reading assignment (for Tuesday and Thursday) Durbin et al. (1998) pgs. 46-79 in Biological Sequence Analysis. –Markov chains –Hidden Markov models


Download ppt "Pattern and string matching tools Biology 162 Computational Genetics Todd Vision 9 Sep 2004."

Similar presentations


Ads by Google