Bayesian Classification of Protein Data Thomas Huber Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics.

Similar presentations


Presentation on theme: "Bayesian Classification of Protein Data Thomas Huber Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics."— Presentation transcript:

1 Bayesian Classification of Protein Data Thomas Huber huber@maths.uq.edu.au Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics The University of Queensland huber@maths.uq.edu.au

2 Today’s talk Protein score functions from mining protein data –Bayesian classification A toy example A protein scoring function for fold recognition Where are score/energy functions useful? –A few examples

3 Why do we care about Protein Structures/Prediction? Academic curiosity? –Understanding how nature works Urgency of prediction –  10 4 structures are determined insignificant compared to all proteins –sequencing = fast & cheap –structure determination = hard & expensive Transistors in Intel processors TrEMBL sequences (computer annotated) SwissProt sequences (annotated) structures in PDB

4

5 Three basic choices in (molecular) modelling Representation –Which degrees of freedom are treated explicitly Scoring –Which scoring function (force field) Searching –Which method to search or sample conformational space

6 Protein Scoring Functions from Mining Protein Data Classification Theory –Find a set of classes and their descriptors (a classification) for n data q attributes (shape, amino acid type, etc.) Theory of finite mixtures Class  attribute probability distribution of all members

7 Bayesian approach Simplifications –Stating a simplified model –Assume attributes are independently distributed P(X i  c j |S) requires class description –Expectation Maximization (EM)

8 How many classes Again Bayes’ rule P(m) favours smaller number of classes –No over-fitting of data (like with maximum likelihood methods)

9 A Toy Example Dihedral preference of Valine Four interesting degrees of freedom –  -,  -dihedral angle –Adjacent amino acid types Data:893 non-redundant proteins –12074 four-dimensional data points   i-1i+1

10 Valine Data Classification AutoClass classification –Model: Gaussian distribution for  / , discrete probabilities for amino acids –Total of 50 tries with #classes  [2:11] –Each try refined until fully converged  Best classification has 5 classes

11 Amino Acid Attribute vectors of  -helix Classes Log-Preferences

12 Re-invention of the Wheel Textbook secondary structure pattern –Helices are likely on outside of proteins –I, I+3 and I+4 hydrophobic interface From C.-I. Branden and J. Tooze, Introduction to Protein Structure

13 Fragment-based Protein Scoring Find classification for fragments of size 7 residues –237566 fragments (1494 non-redundant protein chains) –28 descriptors 7 amino acid type 14  -/  -dihedral angles 7 number of neighbours of each amino acid  200 CPU hours on National Facility computers  325 classes (modelling the probability distribution of native fragments) Use this classification to evaluate likelihood of a fragment sequence- structure match Total score =  fragment scores

14 Fold Recognition = Computer Matchmaking Structure Disco

15 Does it work? Discrimination (TIM 1amk_) Generalisation 1 2 3 4 5 1 2 5 3 4

16 Sequence-Structure Matching The search problem Gapped alignment = combinatorial nightmare

17 Why is Fold Recognition better than Sequence Comparison? Comparison is done in structure space not in sequence space

18 Finding Remote Homologues with sausage 572 sequence-structure pairs Structures are similar (FSSP) > 70% structurally aligned < 20% sequence identity

19 RNA-dependent RNA Polymerases

20 A Real Case Example RNA-dependent RNA polymerases Dengue virus Bacteriophage  6

21 Is this Yet Another Profile Method? Yes, but a much more general profile method –Profile is not residue based (like profile-like threading force fields) –Profiles not for protein families (like in HMMs or  -Blast) –BUT local sequence profiles for optimally chosen classes of fragments Local profiles can be arbitrarily assembled –Extreme flexibility Sequence-structure alignment (=assembling best profile matches) –Deterministic, using dynamic programming

22 People sausage –Andrew Torda (RSC) –Oliver Martin (RSC) GlnB/GlnK, RdR polymerases –Subhash Vasudevan (JCU) Sausage and Cassandra freely available http://rsc.anu.edu.au/~torda huber@maths.uq.edu.au


Download ppt "Bayesian Classification of Protein Data Thomas Huber Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics."

Similar presentations


Ads by Google