Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 7 - Sequence patterns1 Chapter 7 – Sequence patterns (first part) We want a signature for a protein sequence family. The signature should ideally.

Similar presentations


Presentation on theme: "Chapter 7 - Sequence patterns1 Chapter 7 – Sequence patterns (first part) We want a signature for a protein sequence family. The signature should ideally."— Presentation transcript:

1 Chapter 7 - Sequence patterns1 Chapter 7 – Sequence patterns (first part) We want a signature for a protein sequence family. The signature should ideally satisfy: All sequences in the family should satisfy the signature No other sequences should satisfy the signature We can divide the used signatures into Probabilistic, a score is calculated between a sequence and the signature (how well a sequence match the model of the family) –Profile, HMM profile,... Deterministic, a sequence either satisfies (matches) the signature, or not –Regular expresion, sequence pattern (motif)

2 Chapter 7 - Sequence patterns2 Regular expressions – from Gusfield 3.6 A method for describing a pattern A pattern can be used to describe what is common to a set of sequences/strings Example: PROSITE-pattern –[AS]-x(2,4)-A-x(1)-[CA] –Symbols in [] means alternative –x(i,j) means between i and j arbitrary symbols (wild cards) –Several other PROSITE rules

3 Chapter 7 - Sequence patterns3 Regular expressions cont’ Formal definition of a regular expression (RE) –  is an alphabet (e.g. The 20 amino acids) –{* + ( )  } are not in  A string T matchs a regular expression R if R specifies T

4 Chapter 7 - Sequence patterns4 Regular expressions cont’ We can represent a regular expression R as a graph G(R) (non-deterministic finite state machine). –Make a start node s –Make an end node t –Each edge are labeled by a symbol from –A path from s to t represent a string specified by R –All strings specified by R corresponds to a path

5 Chapter 7 - Sequence patterns5 Search with regular expression Search for match (T,R) Are there substrings of T matching R See first if match(prefix(T),R) Are there prefixes of T matching R? –Make sets N(0), N(1), …. If T is of length m, and the regular expression R contains n symbols, then it is possible in time O(nm) to decide if T contains a substring that matches R.

6 Chapter 7 - Sequence patterns6 Prosite language PROSITE is a database of protein families and domains. The standard one-letter codes for the amino acids are used The symbol `x' is used for an arbitrary amino acid Ambiguities are listed between square parentheses `[ ]'. For example: [AGL]= stands for A or G or L Amino acids that are not accepted at a given position are listed between `{ }'. For example: {CH} stands for any amino acid except C and H

7 Chapter 7 - Sequence patterns7 Prosite language cont’ `-' is used for separating the elements Repetition of an element is specified with a numerical value or a numerical range between parenthesis, such that x(3) corresponds to x-x-x and x(1,3) corresponds to x or x-x or x-x-x When a pattern is restricted to either the N- or C- terminal of a sequence, that pattern either starts with a ` ' symbol A period ends the pattern

8 Chapter 7 - Sequence patterns8 Prosite language [RK]-x(2,3)-[DE]-x(2,3)-Y is matched by KLRACEDEEYRE D-x-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}- [LIVMC]-[DENQSTAGC]-x(2)-[DE]-[LIVMFYW] is matched by MADANADDDCTAADWST

9 Chapter 7 - Sequence patterns9 Exact/approximate matching Shall we make a unique pattern, and allow variations in the search? Shall we allow variations in the pattern? Consider deterministic patterns –Constituted of components and wildcard regions –Restrictions in the number of, and in types on these, defines classes of patterns –A component are of fixed length, but can be Unique Ambigeous –A wildcard region can be Fixed, of fixed length Flexible, varying length Given a set of sequences, try to discover a pattern of a given class

10 Chapter 7 - Sequence patterns10 Scoring of patterns Score the components by scoring each position, and then add Score the wildcards regions Sum over all Use information content –The information content of a position with value K i is the reduction in uncertainty of knowing K i relative knowing nothing. –Scoring of wildcard regions should decrease with increasing flexibility Scoring of x(j k, i k ) could be –c(j k -i k )

11 Chapter 7 - Sequence patterns11 Generalization/specialization Generalization of a pattern means weakening it Specialization means strengthening it If p’ is a generalization of p, then all sequences that matches p also matches p’

12 Chapter 7 - Sequence patterns12 Pattern discovery


Download ppt "Chapter 7 - Sequence patterns1 Chapter 7 – Sequence patterns (first part) We want a signature for a protein sequence family. The signature should ideally."

Similar presentations


Ads by Google