Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics Ayesha M. Khan 9 th April, 2012. What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

Similar presentations


Presentation on theme: "Bioinformatics Ayesha M. Khan 9 th April, 2012. What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved."— Presentation transcript:

1 Bioinformatics Ayesha M. Khan 9 th April, 2012

2 What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved motifs that reflect shared structural or functional characteristics of the constituent sequences.  Such conserved motifs may be used to build characteristic signatures that aid family and/or functional diagnoses of newly determined structures. Lec-102

3 Conservation patterns -functional cues E.g. the amino acids that are consistently found at enzyme active sites, or the nucleotides that are associated with transcription factor binding sites. Lec-103 ATP/GTP binding proteins

4 Conservation patterns ….. -functional cues Lec-104 GAL4 binding sequence

5 So what exactly is a pattern? Pattern describes a motif using a qualitative consensus sequence ◦ Uses regular expression (reducing the sequence data to a consensus) ◦ Mismatches are not tolerated ◦ E.g., [GA]-[IMFAT]-H-[LIVF]-H-{S}-x-[GP]-[SDG]-x-[STAGDE] ◦ Each position in pattern is separated with a hyphen ◦ x can match any residue ◦ [ ] are used to indicate ambiguous positions in the pattern ◦ { } are used to indicate residues that are not allowed at this position ◦ ( ) surround repeated residues, e.g. A(3) means AAA Lec-105

6 “Rules” “Rules” are patterns which are much shorter, generic and not associated with specific protein families. They may denote sugar attachment sites, phosphorylation or hydroxylation sites etc. ◦ N-glycosylation site: N-{P}-[ST]-{P} ◦ Protein kinase C phosphorylation site:[ST]-x-[RK] Realistically, short motifs can only be used to provide a guide as to whether a certain type of functional site might exist in a sequence, which must be verified by experiment. Lec-106

7 In cases of extreme sequence divergence:  The following approaches can be used to identify distantly related members to a family of protein (or DNA) sequences Position-specific scoring matrix (PSSM) Profile Hidden Markov Model These methods work by providing a statistical frame where the probability of residues or nucleotides at specific sequences are tested Thus, in multiple alignments, information on all the members in the alignment is retained. Lec-107

8 Sequence Profiles A sequence profile is a position-specific scoring matrix (PSSM) that gives a quantitative description of a sequence motif. Unlike deterministic patterns, profiles assign a score to a query sequence and are widely used for database searching. A simple PSSM has as many columns as there are positions in the alignment, and either 4 rows (one for each DNA nucleotide) or 20 rows (one for each amino acid). Lec-108

9 PSSM 9 Mk j score for the jth nucleotide at position k pk j probability of nucleotide j at position k p j “background” PSSM probability of nucleotide j

10 Computing a PSSM Lec-1010 Ck j : No. of jth type nucleotide at position k Z: Total no of aligned sequences p j : background probability of nucleotide j pk j : probability of nucleotide j at position k

11 Computing a PSSM… Lec-1011

12 Computing a PSSM… Lec-1012

13 Computing a PSSM… Lec-1013

14 PSI-BLAST Position-Specific Iterated BLAST Many proteins in a database are too distantly related to a query to be detected using standard BLAST. In many other cases matches are detected but are so distant that the inference of homology is unclear. Enter the more sensitive PSI-BLAST Lec-1014

15 PSI-BLAST scheme Lec-1015 BLAST input sequence to find significant alignments Construct a MSAConstruct a PSSM BLAST PSSM profile to search for new hits

16 PSI-BLAST… The search process is continued iteratively, typically about 5 times, and at each step a new PSSM is built. The search process can be stopped at any point, typically whenever few new results are returned or no new sensible results are found. Lec-1016

17 PSI BLAST errors Unrelated hits- how to avoid them? ◦ Perform multi-domain splitting of your query sequence ◦ Inspect each PSI-BLAST iteration, removing suspicious hits ◦ Lower the Expect-level (E-value) Lec-1017

18 Markov Model Markov Chain ◦ A Markov chain describes a series of events or states There is a certain probability to move from one state to the next state This is known as the transition probability Sequences can also be seen as Markov chains where the occurrence of a given nucleotide may depend on the preceding nucleotide In a Markov model all states are observable Lec-1018

19 Hidden Markov model Lec-1019 A Markov model may consist of observable states and unobservable or “hidden” states. The hidden states also affect the outcome of the observed states. In a sequence alignment, a gap is an unobserved state that influences the probability of the next nucleotide. In DNA, there are four symbols or states: G, A, T and C (20 in proteins). The probability value associated with each symbol is the emission probability.

20 Markov Model-example Lec-1020 A0.80 C0.02 G0.10 T0.08 A0.11 C0.08 G0.32 T0.49 0.40 Transition probability Emission probability This particular Markov model has a probability of 0.80 X0.40 X 0.32 = 0.102 to generate the sequence AG  This model shows that the sequence AT has the highest probability to occur Where do these numbers come from? A Markov model has to be “trained” with examples

21 Hidden Markov model… Lec-1021 The frequencies of occurrence of nucleotides in a multiple aligned sequence is used to calculate the emission and transition probabilities of each symbol at each state The trained HMM is then used to test how well a new sequence fits to the model A state can either be a match/mismatch (mismatch is low probability match) (observable) Insertion (hidden) Deletion (hidden)

22 Markov models (contd) Example: A general Markov chain modeling DNA *note that any sequence can be traced through the model by passing from one state to the next via transitions A Markov chain is defined by: A finite set of states, S 1, S 2, S 3 ….S N A set of transition probabilities, a ij An initial state probability distribution (or emission probability) π i A Markov chain is defined by: A finite set of states, S 1, S 2, S 3 ….S N A set of transition probabilities, a ij An initial state probability distribution (or emission probability) π i

23 Markov chain example x={a, b} We observe the following sequence: abaaababbaa Transition probabilities: Initial state probabilities:

24 Markov models (contd) Lec-1024 Typical questions we can ask with Markov chains are: What is the probability of being in a particular state at a particular time?(By time here we can read position in our query sequence) What is the probability of seeing a particular sequence of states? (i.e., the score for a particular query sequence given the model)

25 Markov chains-positional dependencies Lec-1025 The connectivity or topology of a Markov chain can easily be designed to capture dependencies and variable length motifs.

26 Markov chains-Insertions and deletions Lec-1026

27 Markov chains-boundary detection Lec-1027 Given a sequence we wish to label each symbol in the sequence according to its class (e.g. transmembrane regions or extracellular/cytosolic) How is it possible?

28 Markov chains-boundary detection Markov chains-boundary detection contd. Lec-1028  Given a training set of labeled sequences we can begin by modeling each amino acid as hydrophobic (H) or hydrophilic (L) i.e. reduce the dimensionality of the 20 amino acids into two classes e.g., A peptide sequence can be represented as a sequence of Hs and Ls. e.g. HHHLLHLHHLHL...

29 Markov chains-boundary detection Markov chains-boundary detection contd. Lec-1029 A simpler question: is a given sequence a transmembrane sequence? A Markov chain for recognizing transmembrane sequences: Question: Is sequence HHLHH a transmembrane protein? P(HHLHH) = 0.6 x 0.7 x 0.7 x 0.3 x 0.7 x 0.7 = 0.043


Download ppt "Bioinformatics Ayesha M. Khan 9 th April, 2012. What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved."

Similar presentations


Ads by Google