Functional Genomic Hypothesis Generation and Experimentation by a Robot Scientist King et al, Nature : Presented by Monica C. Sleumer February 5, 2004
Scientific Discovery “Branch of AI devoted to developing algorithms for acquiring scientific knowledge” Current applications: –Analysis of mass-spec data –Discovering structure-activity relationships for compounds –Making semantic connections in published literature –Predicting mechanisms for chemical reactions –Revising taxonomies to accommodate new data Connect to laboratory instrumentation
Accomplishment Automated entire scientific process Robotic system that uses AI to “carry out cycles of scientific experimentation”: –Originates hypotheses –Designs experiments –Performs the experiments –Interprets the results
Application: Functional genomics Function unknown for 30% of yeast genes Complete laboratory automation possible Goal: connect genes to their function Using: –Logical model of aromatic amino acid synthesis pathway –8 deletion mutants –9 metabolites –Auxotrophic growth experiments
Aromatic Amino Acid Pathway
Classical vs Robot Science Classical method: –Scientific expertise and imagination used to form hypotheses –Consequences of hypotheses tested by experiment Robot Scientist: –Hypotheses formed by abduction –Tested by deduction
Deduction and Abduction Deduction –Rule: P Q, Fact: ~Q, Infer: ~P –E.g.If a cell grows on minimal medium, then it can synthesise tryptophan –Fact Cell cannot synthesise tryptophan – ∴ Cell cannot grow on minimal medium Abduction –Rule: P Q, Fact: ~P, Hypothesize: ~Q –E.g.If a cell grows on minimal medium, then it can synthesise tryptophan –Fact Cell cannot grow on minimal medium – ∴ Cell cannot synthesise tryptophan
Implementation Software: –Background knowledge –Logical inference engine –Hypothesis generation code –Experiment selection code –LIMS code Hardware: –Liquid-handling robot –Plate reader –CPU to do the scientific reasoning No human intellectual input into: –Experimental design –Data interpretation
Robot Scientist
Logical Process Prolog used to model data Metabolic pathway represented as a directed graph Deduction: a knockout mutant will grow IFF a path can be found from the given metabolites to the 3 needed aa. Abduction: if a knockout mutant doesn’t grow using the given metabolites: hypothesize which enzyme is missing
Machine Learning Improves performance based on prior experience Each hypothesis has –Cost of testing –Probability of being correct Goals –Find out which gene goes with which enzyme –Use the fewest possible resources
Experiment Choosing 3 ways: –Intelligent: “ASE” –Cheapest Experiment: Naïve –Random Experiment Performance: –Accuracy: # of correct predictions made –Cost and number of experiments required Both real experiments and simulations Comparison to human
Accuracy of the Experiment Choosers ASE Naive Random ASE Naive Random
Results of Computer Simulations ASE Naive Random ASE Naive No noise Noise
Conclusions Scientific process can be automated Experiment selection strategies have significant impact on cost ASE outperforms –Naïve by 3 fold –Random by 100 fold in terms of cost Performance is competitive with human Cost-effectiveness of science can be improved
Future Work Extend system to uncover function of other metabolic genes Would need to: –Extend model to entire biochemical pathway in KEGG –Become more robust in terms of possible errors in KEGG –Include prediction of previously unknown enzymes
Criticisms De-emphasis on how little of the pathway was actually tested Not clear how deletion mutants were chosen No example of experiment cycle Too large of a jump from theory to results Results graphs too crowded
Discussion Questions Would computer-generated experiments and results be accepted? How much would we have to understand about a computer-generated discovery process? Compare this system to currently common method of: –Large-scale generation of data –Extraction of knowledge by data-mining systems What other aspects of genome analysis could scientific discovery be applied to?