Presentation is loading. Please wait.

Presentation is loading. Please wait.

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.

Similar presentations


Presentation on theme: "Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene."— Presentation transcript:

1 Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene product? Develop a position specific scoring matrix (PSSM).

2 PSSM MGASFMGASF M F W Y G A P V I L C R K E N D Q S T H 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 2 0 0 4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Can include a score for permitting insertions and deletions. Perhaps this position is at a turn, where INDELs are common. INDEL Indel 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

3 Utility of  Blast Identify distantly related proteins based upon the profile. These potential matches may suggest functions. --Profile adds information only over identified region of similarity.

4 Problem of approach: PSI-BLAST is iterative. Takes best hits and improves the scoring matrix. Investigator must be certain that new hits are correct. Investigator must be certain region of interest is included in PSSM.

5 Multiple Sequence Alignment

6 Multiple Sequence Alignment (MSA) Can define most similar regions in a set of proteins –functional domains –structural domains If structure of one (or more) members is known, may be possible to predict some structure of other members

7 Multiple Sequence Alignment amino terminus of Groucho

8 Poor alignment of N (and C) Terminus

9 Well conserved region, bordered by lower similarity. What are the regions of lower similarity?

10 MSA and Sequence Pair Alignment Dynamic programming - (matrix approach) provides an optimal alignment between two sequences. Difficult for multiple alignment, because the number of comparisons grows exponentially with added sequences.

11 Optimal alignment Seq 1 Seq2Seq2

12 How to add a third sequence? Complete all pair-wise comparisons.

13 Each added alignment imposes boundaries on final MSA.

14

15 Optimal Multiple Sequence Alignment For more than three, problem extends into N dimensional space.

16 Scoring MSA Add scores derived from pair-wise alignments. Sum of pairs (SP score). Gaps-constant penalty for any size of gap.

17 Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic distance - number of mismatched positions divided by the total number of matched positions (gaps not considered).

18 Example Domain: a segment of a protein that can fold to a 3D structure independent of other segments of the protein. Card Domain Caspase recruitment domains (CARDs) are modules of 90 - 100 amino acids involved in apoptosis signaling pathways. http://www.mshri.on.ca/pawson/card.html

19

20

21 Previous tree was Rooted These are Unrooted trees

22 Gaps Clustalw attempts to place gaps between conserved domains. In known sequences, gaps are preferentially found between secondary structure elements (alpha helices, beta strands). Clustalw attempts to place gaps between conserved domains. In known sequences, gaps are preferentially found between secondary structure elements (alpha helices, beta strands).

23 A B C B A C C A B C B A These are equivalent trees

24 Problem with Progressive Alignment: Errors made in early alignments are propagated throughout the MSA

25 Profiles & Gaps From an MSA, a conserved region identified and a scoring matrix (profile) constructed for that region. Each position has a score associated with an amino acid substitution or gap. Blocks- also extracted from MSA, but no gaps are permitted.

26 Block Server http://blocks.fhcrc.org/blocks/blocks_search.html Results

27 Hidden Markov Models Probabilistic model of a Multiple sequence alignment. No indel penalties are needed Experimentally derived information can be incorporated Parameters are adjusted to represent observed variation. Requires at least 20 sequences

28 The bottom line of states are the main states (M) These model the columns of the alignment The second row of diamond shaped states are called the insert states (I) These are used to model the highly variable regions in the alignment. The top row or circles are delete states (D) These are silent or null states because they do not match any residues, they simply allow the skipping over of main states. BM1M2M3M4E I1I2I3I4 D1D2D3D4 M5M6 D5D6 I5I6I0

29 The Evolution of a Sequence Over long periods of time a sequence will acquire random mutations. –These mutations may result in a new amino acid at a given position, the deletion of an amino acid, or the introduction of a new one. –Over VERY long periods of time two sequences may diverge so much that their relationship can not see seen through the direct comparison of their sequences.

30 Hidden Markov Models Pair-wise methods rely on direct comparisons between two sequences. In order to over come the differences in the sequences, a third sequence is introduced, which serves as an intermediate. A high hit between the first and third sequences as well as a high hit between the second and third sequence, implies a relationship between the first and second sequences. Transitive relationship

31 Introducing the HMM The intermediate sequence is kind of like a missing link. The intermediate sequence does not have to be a real sequence. The intermediate sequence becomes the HMM.

32 Introducing the HMM The HMM is a mix of all the sequences that went into its making. The score of a sequence against the HMM shows how well the HMM serves as an intermediate of the sequence. –How likely it is to be related to all the other sequences, which the HMM represents.

33 BM1M2M3M4E Match State with no Indels MSGL MTNL Arrow indicates transition probability. In this case 1 for each step

34 BM1M2M3M4E Match State with no Indels MSGL MTNL Also have probability of Residue at each positon M=1 S=0.5 T=0.5

35 BM1M2M3M4E MSGL MTNL M=1 S=0.5 T=0.5 Typically want to incorporate small probability for all other amino acids.

36 BM1M2M3M4E I1I2I3I4 MS.GL MT.NL MSANI Permit insertion states Transition probabilities may not be 1 I0

37 BM1M2M3M4E I1I2I3I4 MS..GL MT..NL MSA.NI MTARNL Permit insertion states I0

38 BM1M2M3M4E I1I2I3I4 MS..GL-- MT..NLAG MSA.NIAG MTARNLAG D1D2D3D4 M5M6 D5D6 I5I6 DELETE PERMITS INCORPORATION OF LAST TWO SITES OF SEQ1 I0

39 The bottom line of states are the main states (M) These model the columns of the alignment The second row of diamond shaped states are called the insert states (I) These are used to model the highly variable regions in the alignment. The top row or circles are delete states (D) These are silent or null states because they do not match any residues, they simply allow the skipping over of main states. BM1M2M3M4E I1I2I3I4 D1D2D3D4 M5M6 D5D6 I5I6I0

40 Dirichlet Mixtures Additional information to expand potential amino acids in individual sites. Observed frequency of amino acids seen in certain chemical environments –aromatic –acidic –basic –neutral –polar


Download ppt "Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene."

Similar presentations


Ads by Google