Bayesian Classification of Protein Data Thomas Huber Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics.

Bayesian Classification of Protein Data Thomas Huber huber@maths.uq.edu.au Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics The University of Queensland huber@maths.uq.edu.au

Today’s talk Protein score functions from mining protein data –Bayesian classification A toy example A protein scoring function for fold recognition Where are score/energy functions useful? –A few examples

Why do we care about Protein Structures/Prediction? Academic curiosity? –Understanding how nature works Urgency of prediction –  10 4 structures are determined insignificant compared to all proteins –sequencing = fast & cheap –structure determination = hard & expensive Transistors in Intel processors TrEMBL sequences (computer annotated) SwissProt sequences (annotated) structures in PDB

Three basic choices in (molecular) modelling Representation –Which degrees of freedom are treated explicitly Scoring –Which scoring function (force field) Searching –Which method to search or sample conformational space

Protein Scoring Functions from Mining Protein Data Classification Theory –Find a set of classes and their descriptors (a classification) for n data q attributes (shape, amino acid type, etc.) Theory of finite mixtures Class  attribute probability distribution of all members

Bayesian approach Simplifications –Stating a simplified model –Assume attributes are independently distributed P(X i  c j |S) requires class description –Expectation Maximization (EM)

How many classes Again Bayes’ rule P(m) favours smaller number of classes –No over-fitting of data (like with maximum likelihood methods)

A Toy Example Dihedral preference of Valine Four interesting degrees of freedom –  -,  -dihedral angle –Adjacent amino acid types Data:893 non-redundant proteins –12074 four-dimensional data points   i-1i+1

Valine Data Classification AutoClass classification –Model: Gaussian distribution for  / , discrete probabilities for amino acids –Total of 50 tries with #classes  [2:11] –Each try refined until fully converged  Best classification has 5 classes

Amino Acid Attribute vectors of  -helix Classes Log-Preferences

Re-invention of the Wheel Textbook secondary structure pattern –Helices are likely on outside of proteins –I, I+3 and I+4 hydrophobic interface From C.-I. Branden and J. Tooze, Introduction to Protein Structure

Fragment-based Protein Scoring Find classification for fragments of size 7 residues –237566 fragments (1494 non-redundant protein chains) –28 descriptors 7 amino acid type 14  -/  -dihedral angles 7 number of neighbours of each amino acid  200 CPU hours on National Facility computers  325 classes (modelling the probability distribution of native fragments) Use this classification to evaluate likelihood of a fragment sequence- structure match Total score =  fragment scores

Fold Recognition = Computer Matchmaking Structure Disco

Does it work? Discrimination (TIM 1amk_) Generalisation 1 2 3 4 5 1 2 5 3 4

Sequence-Structure Matching The search problem Gapped alignment = combinatorial nightmare

Why is Fold Recognition better than Sequence Comparison? Comparison is done in structure space not in sequence space

Finding Remote Homologues with sausage 572 sequence-structure pairs Structures are similar (FSSP) > 70% structurally aligned < 20% sequence identity

RNA-dependent RNA Polymerases

A Real Case Example RNA-dependent RNA polymerases Dengue virus Bacteriophage  6

Is this Yet Another Profile Method? Yes, but a much more general profile method –Profile is not residue based (like profile-like threading force fields) –Profiles not for protein families (like in HMMs or  -Blast) –BUT local sequence profiles for optimally chosen classes of fragments Local profiles can be arbitrarily assembled –Extreme flexibility Sequence-structure alignment (=assembling best profile matches) –Deterministic, using dynamic programming

People sausage –Andrew Torda (RSC) –Oliver Martin (RSC) GlnB/GlnK, RdR polymerases –Subhash Vasudevan (JCU) Sausage and Cassandra freely available http://rsc.anu.edu.au/~torda huber@maths.uq.edu.au

Bayesian Classification of Protein Data Thomas Huber Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics.

Similar presentations

Presentation on theme: "Bayesian Classification of Protein Data Thomas Huber Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian Classification of Protein Data Thomas Huber Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics.

Similar presentations

Presentation on theme: "Bayesian Classification of Protein Data Thomas Huber Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics."— Presentation transcript:

Similar presentations

About project

Feedback