Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University,

Similar presentations


Presentation on theme: "Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University,"— Presentation transcript:

1 Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University, Stanford, California Challenges in the Computational Modeling of Gene Regulation Thanks to S. Bay, L. Chrisman, A. Grossman, L. MacIntosh, A. Pohorille, J. Shrager, and H. Spencer.

2 Themes in Computational Biology manual development of knowledge bases about biological systems (e.g., E-Cell, EcoCyc, KEGG); manual development of knowledge bases about biological systems (e.g., E-Cell, EcoCyc, KEGG); automated analysis of available genomic and express data (e.g., clustering, inferring regulatory networks); automated analysis of available genomic and express data (e.g., clustering, inferring regulatory networks); interactive tools for visualizing genomic and expression data (e.g., SpotFire). interactive tools for visualizing genomic and expression data (e.g., SpotFire). Three distinctive themes in computational biology have been: However, each approach by itself is incomplete, and a complete solution must combine knowledge, data, and user interaction.

3 Discovery Domain knowledge DFR NBLANBLR RRPhoto PBS Health psbA1 psbA2 cpcB Light + Experimental data Updatedmodel × DFR NBLANBLR RRPhoto PBS Health psbA1 psbA2 cpcB Light + × Biologist Knowledge, Data, and the Biologist

4 Observed expression Observed expression levels from cDNA levels from cDNA microarrays or from microarrays or from other sources other sources Revised model of gene Revised model of gene regulatory processes regulatory processes that explains observed that explains observed expression data expression data Initial model of gene Initial model of gene regulatory processes, regulatory processes, gene ontologies, and gene ontologies, and biological constraints biological constraints Discovery Domain knowledge Experimental data Updatedmodel Biologist Knowledge, Data, and the Biologist

5 Cyanobacteria are 3.5 billion years old and created Earths early oxygen atmosphere. Algae and Cyanobacteria produce most of the oxygen we breath and fix most greenhouse carbon dioxide. Reasons for Studying Cyanobacteria Thus, together they form the base of the marine ecosystem.

6 Collecting Data on Photosynthetic Processes Stress (e.g., High Light) Adaptation Period Sampling mRNA/cDNA Equlibrium Period MicroarrayTrace Continuous Culture (Chemostat) /wwwscience.murdoch.edu.au/teach Health of Culture Time

7 A Biologists Depiction of Photosynthesis

8 Challenge 1: Representing Biological Models qualitative rather than quantitative; qualitative rather than quantitative; abstract in that they ignore many details; abstract in that they ignore many details; causal in that they describe chains of effects; causal in that they describe chains of effects; involve processes that involve biological mechanisms. involve processes that involve biological mechanisms. To assist biologists in their modeling efforts, we must first encode candidate models; however, most biological models are: We need some formal way to represent such models that can be interpreted computationally.

9 Some Representations of Biological Knowledge taxonomies differentialequationsBooleannetworks Bayesiannetworks

10 How do plants modify their photosynthetic apparatus in high light? An Abstract Qualitative Causal Model dspA NBLANBLR RRPhoto PBS Health psbA1 psbA2 cpcB Light + This model is qualitative but relates continuous variables, much as formalisms from qualitative physics (e.g., Forbus, 1984).

11 Challenge 2: Making Predictions from Models that some partial correlations will be zero; that some partial correlations will be zero; that some partial correlation products will be equal; and that some partial correlation products will be equal; and the signs of correlations between variables. the signs of correlations between variables. To evaluate a regulatory model, it must make predictions about quantitative measures of gene expression. These predictions assume each that variable is a linear function of its causal parents, as in Glymour et al.s (1987) Tetrad. A qualitative model cannot predict numeric values but can predict: Some models must also include statements that certain regulatory pathways dominate others.

12 Implications of Three Causal Models XYZ XYZ X YZ XZ.Y = 0 XZ.Y = 0 XZ.Y 0 XZ.Y 0 Note that these implications do not depend on the effects sign.

13 Challenge 3: Encoding Background Knowledge an initial qualitative model of gene regulation; an initial qualitative model of gene regulation; genes that may be involved in the phenomena; genes that may be involved in the phenomena; a taxonomy of these relevant genes; and a taxonomy of these relevant genes; and constraints on links between types of genes. constraints on links between types of genes. To constrain candidate models, we must encode knowledge about biological entities and processes. This background knowledge can take the form of: Analysis of biological data should take into account knowledge about the organism under study.

14 We can start with an initial causal model proposed by biologists. Some Constraints on Biological Models We can also forbid causal links between certain pairs of variables. dspA NBLANBLR RRPhoto PBS Health psbA1 psbA2 cpcB Light + × ×

15 Challenge 4: Revising Models Given Expression Data the initial state from which to start search; the initial state from which to start search; the operators that generate new states; the operators that generate new states; the evaluation function that selects among states; the evaluation function that selects among states; the overall control regime for the search; and the overall control regime for the search; and the halting criterion for ending the search. the halting criterion for ending the search. To revise a regulatory model, we must develop an algorithm that searches through the space of models. This requires us to make design decisions about: We have implemented a two-stage method to search the space of qualitative causal models of gene regulation.

16 Stage 1: Determining Model Structure Initial state: A preliminary model proposed by a biologist. Operators: Add a new link (constrained by variable types); Delete an existing link. Evaluation: Agreement with predicted relations among partial correlations, similar to those used in Tetrad. Control: Greedy search to select best structure on each round. Halting: Stop when there is no further improvement in the evaluation metric. Our system carries out heuristic search through the space of causal model structures.

17 Stage 2: Adding Signs to the Model Initial state: The unsigned model structure generated in Stage 1. Operators: Associate a sign (+ or –) with a given link; Label some pathways as dominant over others. Evaluation: Agreement with the signs of correlations computed from the data. Control: Exhaustive search for small models; Greedy search for more complex models. Halting: Stop when each link has an associated sign. Our system carries out a second search through the space of signed qualitative models.

18 Expression Data on Photosynthetic Regulation Initial study produced four replications at each of five time steps.

19 Changes to the model improve its match to the expression data. A Revised Model of Photosynthesis Regulation - + dspA NBLANBLR RRPhoto PBS Health psbA1 psbA2 cpcB + - Light + × × Similar changes adapt the model to expression data from mutants.

20 Challenge 5: Dealing with Small Data Sets starts from an initial model rather than from scratch; starts from an initial model rather than from scratch; incorporates biological constraints on model revisions; incorporates biological constraints on model revisions; uses bootstrap sampling to generate 20 data sets, then runs the revision method 20 times and retains only changes that occur in at least 75% of the runs. uses bootstrap sampling to generate 20 data sets, then runs the revision method 20 times and retains only changes that occur in at least 75% of the runs. Microarray technology provides many measurements but it often gives very few samples. To reduce variance and avoid overfitting these data, our method: Experimental studies suggest that these strategies reduce variance and produce more robust models.

21 Experimental Studies with Synthetic Data To evaluate our revision method, we used a target model to create synthetic data and systematically varied distance from that model. The number of incorrect revisions seems unaffected by distance.

22 represent biological models with time-delayed effects; represent biological models with time-delayed effects; utilize these time-delayed models to make predictions; utilize these time-delayed models to make predictions; evaluate alternative models in terms of their fit to data; evaluate alternative models in terms of their fit to data; carry out search through the space of alternative models. carry out search through the space of alternative models. Many biological processes occur over extended periods of time; to deal with such phenomena, we need methods that: We have extended our framework to handle qualitative causal models with time delays and we have done initial evaluations. Challenge 5: Dealing with Temporal Phenomena

23 We can handle temporal phenomena by adding time delays to links. A Regulatory Model with Time Delays dspA NBLANBLR RRPhoto PBS Health psbA1 psbA2 cpcB Light 6 15 This model predicts the systems qualitative behavior over time.

24 Synthetic Data from Time-Delay Model Light NBLA Health

25 A Method for Revising Time-Delay Models Generalize correlation and partial correlation to frequency domain.

26 Our method reconstructs most of this model from synthetic data. A Reconstructed Model with Time Delays dspA NBLANBLR RRPhoto PBS Health psbA1 psbA2 cpcB Light 6 × Determining the link delays from time series seems tractable, but this requires a high sampling rate.

27 specify qualitative causal models of biological systems; specify qualitative causal models of biological systems; display and edit a models structure and details graphically; display and edit a models structure and details graphically; incorporate knowledge and results from previous studies; incorporate knowledge and results from previous studies; evaluate the evidence in favor of specific hypotheses; evaluate the evidence in favor of specific hypotheses; propose revisions to the model in response to observations. propose revisions to the model in response to observations. We are developing an environment that lets its biologist users: The environment will offer computational assistance in forming and evaluating models but let the biologist retain control. Challenge 6: Interfacing with Biologists

28 An Interactive Environment for Biological Modeling

29 Additional Work on Biological Modeling developing other approaches to revising regulatory models, including Bayesian scoring and neural networks; developing other approaches to revising regulatory models, including Bayesian scoring and neural networks; introducing taxonomic knowledge about genes and biological processes to constrain the search process; and introducing taxonomic knowledge about genes and biological processes to constrain the search process; and expanding the modeling formalism to represent biological mechanisms in addition to abstract processes. expanding the modeling formalism to represent biological mechanisms in addition to abstract processes. Our ongoing research on biological model revision has involved: Thus, we continue to explore ways to combine knowledge with data to aid the creation of biological models.

30 Additional Models and Data naturalistic data on photosynthesis regulation in Cyanobacteria in a setting that mimics the day/night cycle; naturalistic data on photosynthesis regulation in Cyanobacteria in a setting that mimics the day/night cycle; testing if certain genes are targets of unobserved transcription factors, using time-series data on the yeast cell cycle; testing if certain genes are targets of unobserved transcription factors, using time-series data on the yeast cell cycle; testing whether the transcription factor c-Jun is activated by anything other than Jnk2, using data on healthy lung tissue. testing whether the transcription factor c-Jun is activated by anything other than Jnk2, using data on healthy lung tissue. We are also applying our biological modeling framework to: These efforts should further test the robustness of our approach and provide evidence of its generality.

31 Intellectual Influences qualitative physics and simulation (e.g., Forbus, 1984); qualitative physics and simulation (e.g., Forbus, 1984); linear causal models and their inference (Glymour et al., 1987); linear causal models and their inference (Glymour et al., 1987); computational scientific discovery (e.g., Langley et al., 1987); computational scientific discovery (e.g., Langley et al., 1987); theory revision in machine learning (e.g., Towell, 1991); theory revision in machine learning (e.g., Towell, 1991); interactive tools for data analysis (e.g., Schneiderman, 2001). interactive tools for data analysis (e.g., Schneiderman, 2001). Our approach to computational biological discovery borrows ideas from many traditions: Our work combines, in novel ways, insights from machine learning, knowledge representation, and human-computer interaction.

32 Contributions of the Research representing biological models that are qualitative and abstract; representing biological models that are qualitative and abstract; making testable predictions from such qualitative causal models; making testable predictions from such qualitative causal models; encoding knowledge about biological entities and processes; encoding knowledge about biological entities and processes; utilizing knowledge and data to revise initial process models; utilizing knowledge and data to revise initial process models; making revision methods robust despite small amounts of data; making revision methods robust despite small amounts of data; developing interactive tools that let biologists remain in control. developing interactive tools that let biologists remain in control. In summary, our work on computational biological modeling and discovery responds to six major challenges: Taken together, our six responses constitute a novel and promising approach to elucidating biological models.

33

34 Pat Langley Jeff Shrager Institute for the Study of Learning and Expertise Palo Alto, California and Andrew Pohorille Center for Computational Astrobiology NASA Ames Research Center Moffett Field, California Revising Qualitative Models of Gene Regulation Thanks to S. Bay, L. Chrisman, A. Grossman, L. MacIntosh, and H. Spencer.

35 Greedy Search Through a Space of Models Initial model Revision 1.1Revision 1.2Revision 1.3Revision 1.4 Revision 2.1Revision 2.2Revision 2.3Revision 2.4 Revision 3.1Revision 3.2Revision 3.3Revision 3.4

36 Synthetic Data from Time-Delay Model


Download ppt "Pat Langley Institute for the Study of Learning and Expertise Palo Alto, California and Center for the Study of Language and Information Stanford University,"

Similar presentations


Ads by Google