Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University,

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University, Stanford, California http://www.isle.org/~langley langley@isle.org Computational Discovery of Communicable Scientific Knowledge Thanks to S. Bay, V. Brooks, S. Klooster, A. Pohorille, C. Potter, K. Saito, J. Shrager, M. Schwabacher, and A. Torregrosa.

Motivations for Computational Discovery better predict and control future events better predict and control future events understand both previous and future events understand both previous and future events communicate that understanding to others communicate that understanding to others Humans strive to discover new knowledge from experience so that they can: Computational techniques should let us automate and/or assist this discovery process. Recent research on computer-aided discovery has focused on some of these issues but downplayed others.

The Data Mining Paradigm emphasizing the availability of vast amounts of data; emphasizing the availability of vast amounts of data; drawing on heuristic search methods to find regularities in these data; drawing on heuristic search methods to find regularities in these data; using formalisms like decision trees, association rules, and Bayes nets to describe those regularities. using formalisms like decision trees, association rules, and Bayes nets to describe those regularities. One computational discovery paradigm, known as data mining or KDD, can be best characterized as: Thus, most KDD researchers favor their own formalisms over those used by scientists and engineers. As a result, their discoveries are seldom very communicable to members of those communities.

Myths About Understandability decision trees and rules are inherently understandable decision trees and rules are inherently understandable because logical formalisms are easier to interpret than other notations. because logical formalisms are easier to interpret than other notations. Within the data mining paradigm, one quite popular myth is that: However, Kononenko found that doctors felt that naïve Bayesian classifiers were easier to interpret than decision trees. Conclusion: Any formalisms understandability depends on the interpreters familiarity with that formalism.

Myths About Understandability connectionist methods produce results that are opaque connectionist methods produce results that are opaque because the set of weights they learn cannot be easily interpreted. because the set of weights they learn cannot be easily interpreted. Another popular myth in the data mining community is that: However, Saito and Nakano (1997) have shown that one can use such methods to discover explicit numeric equations. Conclusion: Understandability depends on the resulting formalism, not on the search method used to discover knowledge.

Computational Scientific Discovery drawing on heuristic search to find regularities in scientific data, either historical or novel; drawing on heuristic search to find regularities in scientific data, either historical or novel; using formalisms like numeric laws, structural models, and reaction pathways to describe regularities. using formalisms like numeric laws, structural models, and reaction pathways to describe regularities. An older paradigm, computational scientific discovery, can be characterized as: Thus, researchers in this framework favor representations used by scientists and engineers. As a result, their systems discoveries are more communicable to members of those communities.

Time Line for Research on Computational Scientific Discovery 1989199019791980198119821983198419851986198719881991199219931994199519961997199819992000 Bacon.1–Bacon.5 Abacus, Coper Fahrehneit, E*, Tetrad, IDS N Hume, ARC DST, GP N LaGrange SDS SSF, RF5, LaGramge Dalton, Stahl RL, Progol Gell-Mann BR-3, Mendel Pauli Stahlp, Revolver Dendral AM GlauberNGlauber IDS Q, Live IE Coast, Phineas, AbE, Kekada Mechem, CDP Astra, GP M HR BR-4 Numeric lawsQualitative lawsStructural modelsProcess models Legend

Successes of Computational Scientific Discovery Over the past decade, systems of this type have helped discover new knowledge in many scientific fields: stellar taxonomies from infrared spectra (Cheeseman et al., 1989)stellar taxonomies from infrared spectra (Cheeseman et al., 1989) qualitative chemical factors in mutagenesis (King et al., 1996)qualitative chemical factors in mutagenesis (King et al., 1996) quantitative laws of metallic behavior (Sleeman et al., 1997)quantitative laws of metallic behavior (Sleeman et al., 1997) qualitative conjectures in number theory (Colton et al., 2000)qualitative conjectures in number theory (Colton et al., 2000) temporal laws of ecological behavior (Todorovski et al., 2000)temporal laws of ecological behavior (Todorovski et al., 2000) reaction pathways in catalytic chemistry (Valdes-Perez, 1994, 1997)reaction pathways in catalytic chemistry (Valdes-Perez, 1994, 1997) Each of these has led to publications in the refereed literature of the relevant scientific field (see Langley, 2000).

The Developers Role in Computational Discovery problem formulation representation engineering data manipulation algorithm manipulation filtering and interpretation algorithm invocation

Themes of the Research generating explanations that involve hidden objects/variables generating explanations that involve hidden objects/variables revising existing models rather than starting from scratch revising existing models rather than starting from scratch drawing on domain knowledge to constrain the search process drawing on domain knowledge to constrain the search process developing interactive discovery tools for use by scientists developing interactive discovery tools for use by scientists We aim to extend previous approaches to computational scientific discovery by: Two promising fields in which to pursue this research agenda are Earth science and molecular biology. As in earlier work, the notation for discovered knowledge will be the same as that used by domain scientists.

Some Interesting Questions in Earth Science What environmental variables determine the production of carbon and the generation of various gases? What environmental variables determine the production of carbon and the generation of various gases? What functional forms relate these predictive variables to the ones they influence? What functional forms relate these predictive variables to the ones they influence? How do extreme values of these variables affect behavior of the ecosystem? How do extreme values of these variables affect behavior of the ecosystem? Are the Earth ecosystem parameters constant or have values changed in recent years? Are the Earth ecosystem parameters constant or have values changed in recent years?

Given: Observations about numeric variables (rainfall, sunlight, temperature, NPPc) as they change over space and time. Given: Inferred values for global parameters and intrinsic properties associated with discrete variables (e.g., ground cover). The Task of Ecological Model Revision Given: A model of Earths ecosystem (CASA) stated as equations that involve observable and hidden variables. Find: A revised ecosystem model with altered equations and/or parametric values that fits the data better.

The NPPc Portion of CASA NPPc = month max (E · IPAR, 0) E = 0.56 · T1 · T2 · W E = 0.56 · T1 · T2 · W T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt 2 T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt 2 T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )] T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )] W = 0.5 + 0.5 · EET / PET W = 0.5 + 0.5 · EET / PET PET = 1.6 · (10 · Tempc / AHI) A · PET-TW-M if Tempc > 0 PET = 1.6 · (10 · Tempc / AHI) A · PET-TW-M if Tempc > 0 PET = 0 if Tempc < 0 PET = 0 if Tempc < 0 A = 0.00000068 · AHI 3 – 0.000077 · AHI 2 + 0.018 · AHI + 0.49 A = 0.00000068 · AHI 3 – 0.000077 · AHI 2 + 0.018 · AHI + 0.49 IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG), 0.95] FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG), 0.95] SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000) SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)

The NPPc Portion of CASA NPPc IPAR PET T1T2We_max E EET Tempc Topt NDVI SOLAR AHI A PET TWM SR FPAR VEG

Improving the NPPc Portion of CASA 1. Transform the model into a multilayer neural network that makes the same predictions. 2. Identify portions of the model that are candidates for revision. 3. Use an error-driven connectionist learning algorithm to revise those portions of the model. 4. Transform the revised multilayer network back into numeric equations using the improved components. One way to improve the NPPc models fit to observed data is to: This approach is similar to Towells (1991) method for revising qualitative models.

The RF6 Discovery Algorithm 1. Creates a multilayer neural network that links predictive with predicted variables using additive and product units. 2. Invokes the BPQ algorithm to search through the weight space defined by this network. They have shown this approach can discover an impressive class of numeric equations from noisy data. Saito and Nakano (2000) describe RF6, a discovery system that: 3. Transforms the resulting network into a polynomial equation of the form y = c i x j d ij. of the form y = c i x j d ij.

Three Facets of Model Revision Altering the value of parameters in a specified equation; Altering the value of parameters in a specified equation; Changing the associated values for an intrinsic property; and Changing the associated values for an intrinsic property; and Replacing the equation for a term with another expression. Replacing the equation for a term with another expression. Rather than initializing weights randomly, the system starts with weights based on parameters in the original model. We have applied this strategy to revise six different portions of the NPPc submodel. We have adapted RF6 to revise an existing quantitative model in three distinct ways:

Altering Parameters in the NPPc Model Initial model: T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )] T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )] Cross-validated RMSE = 467.910 Behavior: Gaussian-like function of temperature difference. Revised model: T2 = 1.80 / [(1 + e 0.05 · (Topt – Tempc – 10.8) ) · (1 + e 0.3 · (Tempc – Topt – 90.33) )] T2 = 1.80 / [(1 + e 0.05 · (Topt – Tempc – 10.8) ) · (1 + e 0.3 · (Tempc – Topt – 90.33) )] Cross-validated RMSE = 461.466 [ one percent reduction ] Behavior: nearly flat function in actual range of temperature difference. Conclusion: The T2 temperature stress term contributes little to the overall predictive ability of the NPPc submodel.

Revising Intrinsic Values in the Model The NPPc submodel includes one intrinsic property, SR, associated with the variable for vegetation type, UMD-VEG. The corresponding RF6 network includes one hidden node for SR and one dummy input variable for each vegetation type. Veg type A B C D E F G H I J K Veg type A B C D E F G H I J K Initial 3.06 4.35 4.35 4.05 5.09 3.06 4.05 4.05 4.05 5.09 4.05 Initial 3.06 4.35 4.35 4.05 5.09 3.06 4.05 4.05 4.05 5.09 4.05 Revised 2.57 4.77 2.20 3.99 3.70 3.46 2.34 0.34 2.72 3.46 1.60 Revised 2.57 4.77 2.20 3.99 3.70 3.46 2.34 0.34 2.72 3.46 1.60 RMSE = 467.910 for the original model; RMSE = 448.376 for the revised model, an improvement of four percent. Observation: Nearly all intrinsic values are lower in the revised model.

Revising Equations in the NPPc Model Initial model: E = 0.56 · T1 · T2 · W E = 0.56 · T1 · T2 · W Cross-validated RMSE = 467.910 Behavior: Each stress term decreases the photosynthetic efficiency E. Revised model: E = 0.521 · T1 0.00 · T2 0.03 · W 0.00 E = 0.521 · T1 0.00 · T2 0.03 · W 0.00 Cross-validated RMSE = 446.270 [ five percent reduction ] Behavior: T1 and W have no effect on E and T2 has only a minor effect. Conclusion: The stress terms are not useful to the NPPc model, most likely because of recent improvements in NDVI measures.

Future Work on Ecological Model Revision Apply the revision method to other parts of NPPc submodel and other static parts of CASA model. Apply the revision method to other parts of NPPc submodel and other static parts of CASA model. Extend the revision method to improve parts of CASA that involve difference equations. Extend the revision method to improve parts of CASA that involve difference equations. Develop software for visualizing both spatial and temporal anomalies, as well as relating them to the model. Develop software for visualizing both spatial and temporal anomalies, as well as relating them to the model. Implement an interactive system that lets scientists direct high-level search for improved ecosystem models. Implement an interactive system that lets scientists direct high-level search for improved ecosystem models.

Visualizing an Improved Model One way to visualize a model involves plotting its rules spatially. Our Earth science collaborators found this useful, as regions often correspond to recognizable ecological zones.

Some Interesting Biological Questions How do organisms acclimate to increased temperature or ultraviolet radiation? How do organisms acclimate to increased temperature or ultraviolet radiation? Why do we observe bleaching of plant cells under high light conditions? Why do we observe bleaching of plant cells under high light conditions? What differences in biological processes exist between a mutant organism and the original? What differences in biological processes exist between a mutant organism and the original? What are the effects on an organisms biological processes when one of its important genes is removed? What are the effects on an organisms biological processes when one of its important genes is removed?

Modeling Microarrary Results on Photosynthesis Given: Knowledge about the genes in Cyanobacteria relevant to the photosynthetic process. Given: Observed expression levels, over time, of the organisms genes in the presence of high ultraviolet light. Find: A revised model with altered reactions and regulations that explains the expression levels and bleaching. Given: Qualitative knowledge about reactions and regulations for Cyanobacteria in a high light situation.

How do plants modify their photosynthetic apparatus in high light? A Model of Photosynthesis Regulation DFR NBLANBLR RRPhoto PBS Health - + + + - - - psbA1 psbA2 cpcB + + - - Light +

Collecting Data on Photosynthetic Processes Stress (e.g., High Light) Adaptation Period Sampling mRNA/cDNA Equlibrium Period MicroarrayTrace Continuous Culture (Chemostat) /wwwscience.murdoch.edu.au/teach www.affymetrix.com/ Health of Culture Time

Microarray Data on Photosynthetic Regulation

Revising a Model of Gene Regulation Starting state: Initial model proposed by the biologist Operators: Add a link, delete a link, determine sign on a link Control: Greedy search for N steps to determine link structure; Exhaustive search to determine best signs on links Evaluation: Agreement with predicted relations among partial correlations, similar to those used in Tetrad Our approach carries out heuristic search through the model space, guided by candidates abilities to explain the data: To reduce variance, the system repeats this process using bootstrap sampling and only makes changes that occur in 75% of the models.

Greedy Search Through a Space of Models Initial model Revision 1.1Revision 1.2Revision 1.3Revision 1.4 Revision 2.1Revision 2.2Revision 2.3Revision 2.4 Revision 3.1Revision 3.2Revision 3.3Revision 3.4

- + Changes to the model improve its match to the expression data. A Revised Model of Photosynthesis Regulation DFR NBLANBLR RRPhoto PBS Health - + + - - psbA1 psbA2 cpcB + - Light + × × Similar changes adapt the model to expression data from mutants.

Future Work on Biological Modeling Add more knowledge about biochemical pathways and use to interpret other microarray data (e.g., rat metabolism, cancer). Add more knowledge about biochemical pathways and use to interpret other microarray data (e.g., rat metabolism, cancer). Introduce taxonomic knowledge to limit the search process and improve final models. Introduce taxonomic knowledge to limit the search process and improve final models. Expand modeling formalism to support biological mechanisms in addition to abstract processes. Expand modeling formalism to support biological mechanisms in addition to abstract processes. Implement an interactive system that lets scientists direct high- level search for improved biological process models. Implement an interactive system that lets scientists direct high- level search for improved biological process models.

Concluding Remarks attempts to move beyond description and prediction to both explanation and understanding; attempts to move beyond description and prediction to both explanation and understanding; uses domain knowledge to initialize search and to characterize differences from revised model; uses domain knowledge to initialize search and to characterize differences from revised model; presents the new knowledge in some communicable notation that is familiar to domain experts. presents the new knowledge in some communicable notation that is familiar to domain experts. In summary, unlike work in the data mining paradigm, our research on computational discovery: This approach seems especially appropriate for manipulating and understanding complex scientific and engineering data.

In Memoriam Herbert A. Simon (1916 – 2001) Herbert A. Simon (1916 – 2001) Jan M. Zytkow (1945 – 2001) Jan M. Zytkow (1945 – 2001) Earlier this year, computational scientific discovery lost two of its founding fathers: Both contributed to the field in many ways: posing new problems, inventing methods, training students, and organizing meetings. Moreover, both were interdisciplinary researchers who contributed to computer science, psychology, philosophy, and statistics. Herb Simon and Jan Zytkow were excellent role models that we should aim to emulate.

A Closing Quotation We would like to imagine that the great discoverers, the scientists whose behavior we are trying to understand, would be pleased with this interpretation of their activity as normal (albeit high-quality) human thinking... But science is concerned with the way the world is, not with how we would like it to be. So we must continue to try new experiments, to be guided by new evidence, in a heuristic search that is never finished but always fascinating. Herbert A. Simon, Envoi to Scientific Discovery, 1987.

Visualizing Errors in the Model We can easily plot an improved models errors in spatial terms. Such displays can help suggest causes for prediction errors and thus ways to further improve the model.

Related Research on Discovery equation discovery (Langley et al. 1983; Zytkow et al, 1990; Washio & Motoda, 1998; Todorovski & Dzeroski, 1997); equation discovery (Langley et al. 1983; Zytkow et al, 1990; Washio & Motoda, 1998; Todorovski & Dzeroski, 1997); revision of qualitative models (Ourston & Mooney, 1990; Towell, 1991); revision of qualitative models (Ourston & Mooney, 1990; Towell, 1991); revision of quantitative models (Glymour et al., 1987; Chown & Dietterich, 2000). revision of quantitative models (Glymour et al., 1987; Chown & Dietterich, 2000). Our approach to computational scientific discovery borrows ideas from earlier work on: However, our work combines these ideas in novel ways to produce a discovery system with new functionality.

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University,

Similar presentations

Presentation on theme: "Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University,

Similar presentations

Presentation on theme: "Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University,"— Presentation transcript:

Similar presentations

About project

Feedback