Presentation on theme: "Robots and Automatic Genome Annotation n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth."— Presentation transcript:
Robots and Automatic Genome Annotation n Ross D. King n Department of Computer Science n University of Wales, Aberystwyth
Talk Plan n Data Mining based gene function prediction n The Robot Scientist n Automating annotation and experimentation
Data Mining Prediction n We have developed a method for predicting the functional class of gene products based on data mining. n The idea is to learn a reliable predictive function on the examples of genes with products of known function. n Then apply this function to genes where the functional class is unknown. n Applied to: E. coli, M. tuberculosis, S. cerevisiae, A. thaliana. n We call this approach: Data Mining Prediction (DMP).
Classification schemes (MIPS/GO) 1,0,0,0 "METABOLISM" 1,1,0,0 "amino acid metabolism" 1,1,1,0 "amino acid biosynthesis" 1,1,4,0 "regulation of amino acid metabolism" 1,1,7,0 "amino acid transport" 1,1,10,0 "amino acid degradation (catabolism)" 1,1,99,0 "other amino acid metabolism activities" 1,2,0,0 "nitrogen and sulfur metabolism" 1,3,0,0 "nucleotide metabolism" 1,4,0,0 "phosphate metabolism" 1,5,0,0 "C-compound and carbohydrate metabolism" 1,6,0,0 "lipid, fatty-acid and isoprenoid metabolism" 1,7,0,0 "metabolism of vitamins, cofactors, and prosthetic groups" 1,20,0,0 "secondary metabolism"... and ORFs may have multiple functions too! Hierarchy of classes
Sequence Data 478 attributes in total fielddescriptiontype aa_rat_X% of amino acid X in the proteinreal seq_lenlength of the protein sequenceint aa_rat_pair_X_Y% of the amino acids X and Y consecutivelyreal mol_wtmolecular weight of the proteinint theo_pItheoretical pI (isoelectric point)real atomic_comp_Xatomic composition of X (C,H,N,O,S)real aliphatic_indexaliphatic indexreal hydrogrand average of hydropathyreal strandthe DNA strand'w' or 'c' positionthe number of exons (no. of start positions)int caicodon adaptation indexreal motifsnumber of PROSITE motifsint tmSpansnumber of transmembrane spansint chromosomechromosome number1..16,mit
Homology data YAL001C: mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdk.... PSI-BLAST Sequence database NRDB sfc3: keyword(membrane) length(358) dbref(prosite) dbref(embl) gene tfc sfc3 wsv442 cg9463 f1l3 organism baker's yeast fission yeast white spot virus fruit fly Arabidopsis score 0.0 1.0e-18 2.1 2.9 3.0 We look up the associated information from SwissProt
Predicted Secondary Structure Data mvltiypdelvqivsdkiasnkgkitlnqlwdisgkyfdlsdkkvk... cbbbbccaaaaaaaaaaaacccccbbbbaaaaaacccbbccccccb... We record length and relative positions of the secondary structure elements. This is relational data.
Expression Data Spellman et al (1998), Roth et al (1998) DeRisi et al (1997), Eisen et al (1998) Gasch et al (2000, 2001), Chu et al (1998) Microrarray experiments to measure expression changes in yeast under a variety of conditions, including cell cycle, heat shock, diauxic shift. Short time series data, numerical-valued a 0a 7a 14a 21 YBR166C 0.33-0.17 0.04-0.07 YOR357C -0.64-0.38- 0.32-0.29 YLR292C -0.23 0.19- 0.36 0.14 YGL112C -0.69-0.89-0.74-0.56...
Phenotype Data Data from knockout gene growth experiments Many missing data Data taken from 3 sources (TRIPLES, MIPS, EUROFAN) s = sensitive (less growth) w = wild-type (no observable effect) r = resistant (more growth) n = no data ORF YAL001C YAL019W YAL021C YAL029C calcofluor white w n sorbitol n s n w benomyl n w n w... deleted ORF growth medium H2O2 w n r
What are the Machine Learning Issues? Large volume of data Missing data Accurate results required Intelligible results required Class hierarchy Multiple labels Relational data
Data Mining Prediction (DMP) Entire database Data for rule creation 2/3 1/3 2/3 1/3 PolyFARM C4.5 Rule gener- ation Select best rules Measure rule accuracy Validation data Training data All rules Best rules Test data Results
Application to Bacterial Genomes n Successful for both M. tuberculosis and E. coli. n Of the ORFs with no assigned function >40% were predicted to have a function at one or more levels of the class hierarchy. n It was found that many of the predictive rules were more general than possible using sequence homology. References King et al. (2000) KDD 2000 King et al. (2000) Yeast (Comparative and Functional Genomics) King et al. (2001) Bioinformatics
Summary Results (Bacteria) n Using voting (2 or more rules agree on a prediction) –Level 2 :128 ORFs predicted - 87.5% accuracy –Level 3 : 23 ORFs predicted - 91.3% accuracy n All predictions –Level 2 :335 ORFs predicted - 64.5% accuracy –Level 3: 204 ORFs predicted - 44.6% accuracy
Example Rule (level 2 E. coli) If the ORF is not predicted to have a -strand of length 3 a homologous protein from class Chytridiomycetes was found Then its functional class is Cell processes, Transport/binding proteins 12/13 (86%) correct on Test Set - probability of this result occurring by chance is estimated at 4x10 -7. 24 ORFs of unknown function are predicted by the rule. 16 ORFs now with putative or confirmed function - 93.8% accurate predictions
Experimental Conformation n The original bacterial ORF predictions were made over three years ago. n In the intervening time many more ORFs have been sequenced, making traditional homologous prediction methods more accurate and sensitive, and the function of some ORFs have been determined by wet biology. n The E. coli genome has recently been re-annotated by Monica Rileys group.
Wet Biology conformation n A number of predictions have been confirmed or falsified by new wet experimental data. n This new data is biased towards hard classes. Despite this the results are still good: –Level 2: 23 predictions - 47.8% accuracy –Level 3: 23 predictions - 43.4% accuracy This is very much better than random as there are many classes.
Results (Yeast) n Many rules from each data type n Rules at each level of hierarchy n Some classes are much easier to predict than others (for example "protein synthesis" at 71-93%, "energy" at 20-47%) n Good levels of accuracy on held out test data n Many predictions for ORFs of unknown function (some function at some level is predicted for 96% of the ORFs of unknown function) n Some rules explainable by biology -> scientific knowledge discovery Clare & King (2003) Bioinformatics suppl. 2., 42-49
Extension to Arabidopsis Genome n Collaborative project with the Institute of Grassland and Environmental Research and the University of Nottingham. n Large increase in data: 6,000 -> 25,000 ORFs. Large amount of micro-array data from the Nottingham Arabidopsis stock centre. n 250 million Prolog facts, 200,000 attributes, File sizes almost 2Gb n 7,964 gene function predictions with an expected accuracy >70%, 2,974 with an expected accuracy >90%, n We are currently growing 14 knockout varieties of Arabidopsis to test a sample of these predictions
The Robot Scientist Concept Background Knowledge Machine Learning Analysis Consistent Hypothesis Final Theory Experiment(s) selection Robot Experiments(s) Results The robot scientist project aims to develop a computer system that is capable of originating its own experiments, physically doing them, interpreting the results, and then repeating the cycle.
Motivation: Technological n In many areas of science our ability to generate data is outstripping our ability to analyse the data. n One scientific area where this is true is functional genomics, where data is now being generated on an industrial scale. n The analysis of scientific data needs to become as industrialised as its generation.
The Application Domain n Functional genomics n In yeast (S. cerivasae) ~30% of the 6,000 genes still have no known function. n EUROFAN 2 has knocked out each of the 6,000 genes in mutant strains. n Task to determine the function of the gene by auxotrophic growth experiments comparing mutants and wild type.
Logical Cell Model n We have built a logical model of the known metabolic pathways (coded in Prolog) - taken from KEGG and other bioinformatic sources. This is essentially a directed graph: with metabolites as nodes and enzymes as arcs. n If a path can be found from cell inputs (metabolites in the growth medium) to all the cell outputs (essential compounds), then the cell can grow.
AAA Model System n We started using the aromatic amino-acid (AAA) pathway in yeast as a model system to prove the principle of the Robot Scientist. n 9 metabolities can be used of the shelf n 15 knockout mutants from Eurofan n The mutant can grow iff all three aromatic amino- acids can be synthesised (tyrosine, phenyalalanine, tryptophan). Based on a pathway from glycerate-2- phophate.
Experimental Methodology n Experiments consist of making particular growth media and testing if the mutants can grow (add metabolites to a basic defined medium). n A mutant is auxotrophic if cannot grow on a defined medium that the wild type can grow on. n By observing the pattern of chemicals that recover growth the function of the knocked out mutant can be inferred.
Inferring Hypotheses n In the philosophy of science. It has often been argued that only humans can make the leaps of imagination necessary to form hypotheses. n We use Abductive Logic Programming to infer missing arcs/labels in our metabolic graph. With these missing nodes we can explain (deductively) all the experimental results. Reiser et al., (2001) ETAI 5, 233-244;
The Form of the Hypotheses n The form of the hypotheses we can infer is currently quite simple. Each hypothesis binds a particular gene to an enzyme that catalyses the reaction. –A correct hypothesis would be that: YDR060C codes for the enzyme for the reaction chorismate prephenate. –An incorrect hypothesis would be that: it coded for the reaction chorismate anthranilate. n We have also demonstrated how more complex abductive hypotheses could be formed.
A Discriminating Experiment n Hypothesis 1: YDR060C codes for the enzyme the reaction: chorismate prephenate. n Hypothesis 2: YDR060C codes for the enzyme the reaction: chorismate anthranilate. n These can be distinguished by growing the knockout YDR060C on prephenate or anthranilate. n Note that these two experiments will have differing monetary cost.
Inferring Experiments Given a set of hypotheses we wish to infer an experiment that will efficiently discriminate between them Assume: n Every experiment has an associated cost. n Each hypothesis has a probability of being correct. The task: n To choose a series of experiments which minimise the expected cost of eliminating all but one hypothesis.
Comparison of different experimental strategies n ASE - Expected cost minimization. n Naïve - Choose cheapest experiment. n Random - Randomly choose experiments. The cost of a series of experiment is a function of the time taken and money spent. Time is Money.
Closing the Loop n We have physically implemented all aspects of the Robot Scientist system. n To the best of our knowledge this is the first active learning system that both explicitly forms hypotheses and experiments, and physicals does real experiments.
Accuracy v Time At the end of the 5th iteration: ASE 80.1%, Naïve 74.0%, Random 72.2%. ASE was significantly more accurate than either Naïve (p < 0.05) or Random (p < 0.07) using a paired t-test.
Accuracy v Money Given a spend of £10 2.26, ASE 79.5%, Naïve 73.9%, Random 57.4%. ASE was significantly more accurate than either Naïve (p < 0.05) or Random (p < 0.001).
Time and Money n Cost is a positive function of time & money. ASE dominates for both, therefore ASE dominates for any reasonable cost function. n For example: to achieve an accuracy of ~70%, ASE requires fewer trial iterations, and a hundredth of the price, of Random; and almost half the number of iterations, and a third of the price, of Naïve. King et al. (2004) Nature. 427, 247-252.
Human Comparisons n We were interested to compare the performance of the Robot Scientist with that of humans. n We adopted the simulator to allow humans to chooses and interpret the results of cycles of experimentation. n Compared nine graduate computer scientists and biologists. n No significant difference between the best humans and the Robot
New Biological Knowledge n So far with the Robot Scientist we have only shown that we can automatically rediscover known biological knowledge. n We wish to extend this result to the discovery of new biological knowledge. n To do this we need to combine the robot scientist with conventional genome annotation bioinformatics, and DMP.
Robotic Annotation n One way of thinking about genome annotation is as a hypothesis formation process. n Hypothesis formation is perhaps the hardest part of automating science. n Our idea is to incorporate bioinformatic annotation methods with genome annotation. n The bioinformatic methods will generate the hypotheses which the robot scientist will experimentally test.
Genome Scale Model of Yeast Metabolism n We have extended our model of aromatic amino acid metabolism to cover most of what is known about yeast metabolism. n Includes 1,166 ORFs (940 known, 226 inferred) n Growth if path from growth medium to defined end- points. n 83% accuracy (based on 914 strain/medium predictions)
The Model is Incomplete n It is not possible to find a path from the inputs (growth medium) to all the end-point metabolites using only reactions encoded by known genes. n This suggests automated strategies for determining the identity of the missing genes - new biological knowledge. n One strategy is based on using EC enzyme class of missing reactions, identify genes that code for this EC class in other organism, then find homologous genes in yeast. n The predictions can be tested automatically by robot.
Confirmation of DMP Yeast Predictions n The yeast gene YBR147W, of currently unknown function. n It is predicted to have a function in metabolism by 2 DMP rules with expected accuracies of >80%. n It is predicted to have a function in amino-acid metabolism with two rules with expected accuracies of 50% and 60% respectively. n Using our robot scientist auxotrophic methodology we have recovered growth of the knockout with: aspartic acid, tyrosine, leucine, valine, phenylalanine, cystine, arginine.
Conclusions n Machine learning can be used to accurately predict gene function. n Simple forms of scientific reasoning and experimentation can be fully automated. n To develop robotic systems capable of generating new biological knowledge will require a synthesis of traditional genome annotation techniques, machine learning, and a Robot Scientist like methodology.
The Three Objects of the Intellect The True The Beautiful The Beneficial
Acknowledgements DMP n Andreas KarwathAberystwyth n Amanda ClareAberystwyth n Paul WiseAberystwyth n Luc DehaspeLeuven Robot Scientist n Ken WhelanAberystwyth n Philip ReiserAberystwyth n Ffion JonesAberystwyth n Ugis Sarkans Aberystwyth (EBI) n Douglas KellManchester (Aberystwyth) n Steve OliverManchester n Stephen MuggletonImperial College (York) n Chris BryantRobert Gordons (York) n David PageWisconsin BBSRC, EPSRC PharmDM - Commercial Support
Relational vs Propositional orftime0time7time14 yal001c0.340.520.48 yal002w0.760.820.89 yal003w0.770.460.78 yal004c0.380.500.49 orfSwissProtIDe-val yal001cp034152e-4 yal001cp086408e-58 yal002wp325836e-52 yal002wp087753e-42 SwissProtIDkeyword p03415apoptosis p03415repeat p03415zinc p08640membrane Propositional: single table, fixed number of columns/attributes Relational: multiple tables, multiple values
Expression Data Rule If in the micro-array experiment (sorbitol incubation) the ORF expression is > -0.25 and in the micro-array experiment (nitrogen depletion) the ORF expression is <= -1.29 and in the micro-array experiment (YPD stationary phase) the ORF expression is > - 1.06 then the function of this ORF is pheromone response, mating type determination, sex-specific proteins" Accuracy on training data: 11/12 (92%) Accuracy on the test data: 3/4 (75%) 21 predictions made
Structure Rule 80% accurate on test data Most matching ORFs belong to the Mitochondrial Carrier Family These have 6 long transmembrane alpha-helices of about 20-30 amino acids Why do we notice alpha-helices of length 10-14? If true: coil (of length 3) followed by alpha (10 <= length < 14) and true: coil (of length 1 or 2) followed by alpha (10 <= length < 14) and true: coil (of length 3) followed by alpha (3 <= length < 6) and false: coil followed by beta followed by coil (c-b-c) and false: coil (6 <= length < 10) followed by alpha (of length 1 or 2) then the function of this ORF is "mitochondrial transport"
Types of Logic Deduction Rule: If a cell grows, then it can synthesise tryptophan. Fact: cell cannot synthesise tryptophan Cell cannot grow. Given the rule P Q, and the fact Q, infer the fact P (modus tollens) Abduction Rule: If a cell grows, then it can synthesise tryptophan. Fact: Cell cannot grow. Cell cannot synthesise tryptophan. Given the rule P Q, and the fact P, infer the fact Q