Automated Exploration of Bioinformatics Spaces Simon Colton Computational Bioinformatics Laboratory.

Automated Exploration of Bioinformatics Spaces Simon Colton Computational Bioinformatics Laboratory

Purpose of the Talk  To make you aware of another tool which may have some potential for use in the Metalog project  To get feedback on this potential  To briefly describe two other projects

The Substructure Server  Old-style approach to using machine learning (ML) for predictive toxicology –What do the positives have in common that the negatives do not? –For chemicals, possibly using ILP is like using a sledgehammer to crack a nut  Substructures are often the answer (e.g., mutagenesis) –Substructure server looks explicitly for substructures  Vehicle for me to understand ML in predictive toxicology and server-client technology –May even be of some use one day

Substructure Server Development  Team –Simon Colton  Prolog machine learning routine (FIND-S) –Saravanan Anandathiyagar  Server technology –Laurence Darby  Distributing the process over our linux farm –Gives roughly 5 times speed up –A.N.Other masters student (TBA)  Front end (Babel)  Back end (Molgen, etc.)

Old-Style Predictive Toxicology  Reason 1: –Using only chemistry, attributes etc.  Not using biochemical pathways  Reason 2: –Using predictive machine learning  Not using descriptive machine learning

Predictive Induction in Bioinformatics  Interesting problem found –Interesting from a biochemistry perspective –Interesting from a computer science perspective  Packaged as prediction/classification –Turned into positives and negatives –Much work done to shoe-horn into a prediction task  Reason(s) learned why positives are positive –Almost guaranteed that any answer found will be interesting, because the problem is interesting

Generating Hypotheses  Predictive machine learning produces hypotheses of the form: –A  Toxic –Toxic  C –B  Toxic –D  ¬Toxic –etc.  With any luck, A, B or C will be interesting in their own right –And enter the biochemistry literature!

But what if…  There was an interesting relationship –Between a concept and a subset of the positives. Isn’t this interesting?  Examples: A  Toxic & B C  ¬Toxic & D & E

Predictive versus Descriptive Learning  Predictive learning –You know what you are looking for –You just don’t know what it looks like  Descriptive learning –You don’t know what you are looking for –But you want to find something interesting  Eventually: –You don’t even know you are looking for something

Descriptive Induction  Not as goal directed as predictive induction  Same background information given –Perhaps no categorisation into pos & neg  A theory is produced which contains: –Examples –Concepts which categorise/describe sets of examples –Hypotheses which relate concepts –Explanations which explain the hypotheses  For instance: –Acid + Base  Salt + Water  Tools are supplied so that –The user can extract interesting parts of the theory

The HR System in 3 Slides  Concept formation –Starts with background info like Progol –Builds new concepts from old ones  Using one of 15 production rules  (composition, instantiation, counting, matching, etc.)  Unary or binary  Many settings for how concept formation occurs –Derives examples & definition of concepts  Heuristic search (if user specifies) –Uses a best first search  20+ measures of interestingness for concepts/conjectures  Chooses to build new concepts from best old ones

The HR System in 3 Slides  Conjecture Making –“Proper” induction! –Notices patterns in examples for concepts  Newly formed concept has no examples –Makes a non-existence conjecture  Two concepts have exactly the same examples –Makes an equivalence conjecture  One concept’s examples are subset of another –Makes an implication conjecture –Extracts simpler hypotheses from empirical ones –Able to make “near-conjectures”  Patterns don’t have to be exact  User specifies a tolerance level

The HR System in 3 Slides  Generating explanations –User supplies a set of axioms –HR appeals to a third party theorem prover  And a third party model generator (otter/mace) –To attempt to prove/disprove  That the hypothesis follows from the axioms  Sometimes, explanations are interesting –In domains such as group theory  Explanations are proofs of theorems  Sometimes, explanations show that a hypothesis is dull –Anything provable by the theorem prover is trivial

Extreme(!) Theory Formation  All my best examples are from maths  Given only one concept: –How to divide two integers  HR finds the conjecture –Odd refactorable numbers are squares  Invented concepts: –Odd, square, refactorable, (even, tau, …)  Made concept of odd refactorables –Noticed the examples are a subset of the examples for square numbers  No proof supplied (I proved this one)

What HR Can Deliver  HR generates hypotheses like Progol –But there are too many –Require filters to prune dull ones  Some concepts might be interesting aside from their relation to toxicity  HR points out interesting examples –E.g., a molecule has the only occurrence of a particular sub-molecule

Interesting New Angle  Anomaly detection  First experiments in analysis of Bach chorale melodies –Which ones were different to the rest  Not necessarily breaking rules  Could be: something occurring more often –“Parsimony outlier” measure of interestingness  Hope to try this with metabolic pathways –Give me 30 pathways  I’ll give you reasons why each is unique –Give me an invented pathway  I’ll show you possible reasons it’s wrong…

What I need  Objects of interest –Pathways  Background concepts –Ways to describe the pathways  Axioms –What we know is true about pathways  Measures of interestingness –Essential to separate the wheat from chaff –Evolve over time as we use HR together

Future for my Work  Form theories about biochemical data  Domain of interest –Pathways  Technical problems –Enabling HR to work with probabilistic information (not yet possible) –Enabling HR to work with larger datasets –Understanding pathways!

The Amaze Database  Bioinformatics MSc. Project –Organised by Marek Sergot  Challenge –To resurrect the Amaze database  Of biochemical pathways –EBI originally, now Université libre de Bruxelles –To get hold of data, put into a database, put a front- end onto this, etc. –And write translation routines  So that we can get at the information  This is a resource we should use –Please let me know your requirements

Automated Exploration of Bioinformatics Spaces Simon Colton Computational Bioinformatics Laboratory.

Similar presentations

Presentation on theme: "Automated Exploration of Bioinformatics Spaces Simon Colton Computational Bioinformatics Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automated Exploration of Bioinformatics Spaces Simon Colton Computational Bioinformatics Laboratory.

Similar presentations

Presentation on theme: "Automated Exploration of Bioinformatics Spaces Simon Colton Computational Bioinformatics Laboratory."— Presentation transcript:

Similar presentations

About project

Feedback