Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bayesian Classification of Protein Data Thomas Huber Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics.

Similar presentations


Presentation on theme: "Bayesian Classification of Protein Data Thomas Huber Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics."— Presentation transcript:

1 Bayesian Classification of Protein Data Thomas Huber huber@maths.uq.edu.au Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics The University of Queensland huber@maths.uq.edu.au

2 Today’s talk Protein score functions from mining protein data –Bayesian classification A toy example A protein scoring function for fold recognition Where are score/energy functions useful? –A few examples

3 Why do we care about Protein Structures/Prediction? Academic curiosity? –Understanding how nature works Urgency of prediction –  10 4 structures are determined insignificant compared to all proteins –sequencing = fast & cheap –structure determination = hard & expensive Transistors in Intel processors TrEMBL sequences (computer annotated) SwissProt sequences (annotated) structures in PDB

4

5 Three basic choices in (molecular) modelling Representation –Which degrees of freedom are treated explicitly Scoring –Which scoring function (force field) Searching –Which method to search or sample conformational space

6 Protein Scoring Functions from Mining Protein Data Classification Theory –Find a set of classes and their descriptors (a classification) for n data q attributes (shape, amino acid type, etc.) Theory of finite mixtures Class  attribute probability distribution of all members

7 Bayesian approach Simplifications –Stating a simplified model –Assume attributes are independently distributed P(X i  c j |S) requires class description –Expectation Maximization (EM)

8 How many classes Again Bayes’ rule P(m) favours smaller number of classes –No over-fitting of data (like with maximum likelihood methods)

9 A Toy Example Dihedral preference of Valine Four interesting degrees of freedom –  -,  -dihedral angle –Adjacent amino acid types Data:893 non-redundant proteins –12074 four-dimensional data points   i-1i+1

10 Valine Data Classification AutoClass classification –Model: Gaussian distribution for  / , discrete probabilities for amino acids –Total of 50 tries with #classes  [2:11] –Each try refined until fully converged  Best classification has 5 classes

11 Amino Acid Attribute vectors of  -helix Classes Log-Preferences

12 Re-invention of the Wheel Textbook secondary structure pattern –Helices are likely on outside of proteins –I, I+3 and I+4 hydrophobic interface From C.-I. Branden and J. Tooze, Introduction to Protein Structure

13 Fragment-based Protein Scoring Find classification for fragments of size 7 residues –237566 fragments (1494 non-redundant protein chains) –28 descriptors 7 amino acid type 14  -/  -dihedral angles 7 number of neighbours of each amino acid  200 CPU hours on National Facility computers  325 classes (modelling the probability distribution of native fragments) Use this classification to evaluate likelihood of a fragment sequence- structure match Total score =  fragment scores

14 Fold Recognition = Computer Matchmaking Structure Disco

15 Does it work? Discrimination (TIM 1amk_) Generalisation 1 2 3 4 5 1 2 5 3 4

16 Sequence-Structure Matching The search problem Gapped alignment = combinatorial nightmare

17 Why is Fold Recognition better than Sequence Comparison? Comparison is done in structure space not in sequence space

18 Finding Remote Homologues with sausage 572 sequence-structure pairs Structures are similar (FSSP) > 70% structurally aligned < 20% sequence identity

19 RNA-dependent RNA Polymerases

20 A Real Case Example RNA-dependent RNA polymerases Dengue virus Bacteriophage  6

21 Is this Yet Another Profile Method? Yes, but a much more general profile method –Profile is not residue based (like profile-like threading force fields) –Profiles not for protein families (like in HMMs or  -Blast) –BUT local sequence profiles for optimally chosen classes of fragments Local profiles can be arbitrarily assembled –Extreme flexibility Sequence-structure alignment (=assembling best profile matches) –Deterministic, using dynamic programming

22 People sausage –Andrew Torda (RSC) –Oliver Martin (RSC) GlnB/GlnK, RdR polymerases –Subhash Vasudevan (JCU) Sausage and Cassandra freely available http://rsc.anu.edu.au/~torda huber@maths.uq.edu.au


Download ppt "Bayesian Classification of Protein Data Thomas Huber Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics."

Similar presentations


Ads by Google