Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.

Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University

Why Study Network Inference? It can help us understand how to interpret and when to trust biological networks It is a model for many kinds of complex inference problems in systems biology and beyond It is a great example of a machine learning problem, a kind of computer science central to much work in biology Network inference is a good way of thinking about issues in data abstraction central to all computational thinking

Our Assumptions We will focus specifically on transcriptional regulatory networks, assuming no cycles We will assume, at least initially, that our data source is a set of microarray gene expression values cI Cro + - - + + + + conditions genes *Clustered gene expression data from NCBI Gene Expression Omnibus (GEO), entry GSE1037: M. H. Jones et al. (2004) Lancet 363(9411):775-81. *

Intuition Behind Network Inference 3 1 2 4 3 1 2 4 3 1 2 4 3 1 2 4 3 1 2 4 1 4 3 2 + - - 1 3 2 + - 1 3 2 + - 1 3 2 - - 1 3 2 + - - … conditions genes correlated expression implies common regulation that intuition still leaves a lot of ambiguity

Why Is Intuition Not Enough? Models are ambiguous: Data are noisy: Data are sparse: 3 2 4 1 3 2 4 1 3 2 4 1 … *Clustered gene expression data from NCBI Gene Expression Omnibus (GEO), entry GSE1037: M. H. Jones et al. (2004) Lancet 363(9411):775-81. *

We will assume for the moment that genes only have two possible states: 0 (off) or 1 (on) We will also assume that we want to find directionality but not strength of regulatory interactions: A Next Step Beyond Intuition: Assuming a Binary Input Matrix 10101110 0 1 01 111 0 conditions gene 1 gene 2 00 100001 0000 0 101 gene 3 gene 4 1 3 2 4

Making it Even Simpler: Two Genes Only three possible models to consider 10101110 0 1 01 111 0 conditions gene 1 gene 2 1 2 1 2 1 2 model 1 “G1 regulates G2” model 2 “G2 regulates G1” model 3 “G1 and G2 are independent”

Judging a Model: Likelihood Complicated inference problems like this are commonly described in terms of probabilities We want to infer a model (which we will call M) using a data set (which we will call D) Problems like this are commonly posed in terms of maximizing a likelihood function: We read this as “probability of the data given the model,” i.e., the probability that a given model would generate a given data set

We can describe the probability of a microarray as the product of the probabilities of all of its individual measurements: Pr{ }= Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ } What is the Probability of a Microarray? 10101110 1 11 1 1 00 0

We can estimate Pr{ } and Pr{ } by counting how often each individual value occurs:  Pr{ } = 5/8  Pr{ } = 3/8 Therefore: Pr{ } =Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ }x Pr{ } =5/8 x 5/8 x 3/8 x 3/8 x 5/8 x 5/8 x 5/8 x 3/8 = 0.00503 What is the Probability of One Measurement on a Microarray? 10101110 1 11 1 1 00 0 1 0 1 0

Evaluating One Model 10101110 0 1 01 111 0 gene 1 gene 2 1 2 data D = model M = 10101110 Pr{D|M} = Pr{ } x Pr{ } = 0.00503 x 0.00503 = 2.5 x 10 -5 0 1 01 111 0

Adding in Regulation How do we evaluate output probabilities for a regulated gene? We need the notion of conditional probability: evaluating the probability of gene 2’s output given that we know gene one’s output: 1 2 10101110 0 1 01 111 0 gene 1 gene 2 Pr{G2= |G1= } = 1/5 Pr{G2= |G1= } = 4/5 0 1 1 1 Pr{G2= |G1= } = 2/3 00 Pr{G2= |G1= } = 1/3 1 0

Evaluating Another Model 10101110 0 1 01 111 0 gene 1 gene 2 1 2 data D = model M = 10101110 Pr{D|M} = Pr{ } x Pr{ | } = 0.00503 x (1/5 x 4/5 x 2/3 x 1/3 x 4/5 x 4/5 x 4/5 x 2/3) = 6.1 x 10 -5 0 1 01 111 010101110

Evaluating Another Model 10101110 0 1 01 111 0 gene 1 gene 2 1 2 data D = model M = 10101110 Pr{D|M} = Pr{ | } x Pr{ } = (1/5 x 4/5 x 2/3 x 1/3 x 4/5 x 4/5 x 4/5 x 2/3) x 0.00503 = 6.1 x 10 -5 0 1 01 111 0 0 1 01 111 0

Comparing the Models for Two Genes Pr{ | } = 6.1 x 10 -5 10101110 0 1 01 111 0 1 2 10101110 0 1 01 111 0 1 2 Pr{ | } = 2.5 x 10 -5 10101110 0 1 01 111 0 1 2 Conclusion: Knowing the expression of gene 1 helps us predict the expression of gene 2 and vice versa; we can suggest there should be an edge between them but cannot decide the direction it should take

Generalizing to Many Genes The same basic concepts let us evaluate the plausibility of any regulatory model This is known as a Bayesian graphical model 10101110 0 1 01 111 0 00 100001 0000 0 101 1 3 2 4 Pr{ | } = Pr{ } x Pr{ | } x Pr{ |, } x Pr{ | } 10101110 0 1 01 111 010101110 00 10000110101110 0000 0 101 00 100001 0 1 01 111 0

Adding Prior Knowledge We can also build in any prior knowledge we have about the proper model (e.g., from the literature) We can use that knowledge by simply multiplying each likelihood by our prior confidence in its validity: 10101110 0 1 01 111 0 00 100001 0000 0 101 1 3 2 4 Pr{ | } x Pr{ } x Pr{ } x Pr { } x Pr { } x … 1 2 1 3 1 43 2

Adding in Other Data Types We can also incorporate other pieces of evidence in much the same way Example: suppose we have microarrays and TF binding site predictions: Pr{, ACGATCTCA… | } = Pr{ | } x Pr{ACGATCTCA … | } 10101110 0 1 01 111 0 1 2 10101110 0 1 01 111 0 1 2 1 2 Evaluate as before Evaluate by a binding site prediction method (e.g., PSSM)

Moving from Discrete to Real- Valued Data We can also drop the need for discrete (on or off) data by making an assumption of how values vary in the absence of regulation, e.g., Gaussian: 1.5-0.30.4-1.2 01 1.5-0.30.4-1.2 Pr{ } =

Finding the Best Model We now know how to compare different network models, but finding the best model is not easy; far too many possibilities to compare them all Algorithms for model inference is a more complex topic than we can cover here, but there are some general approaches to be aware of  optimization: many specialized methods exist for finding the best model without trying everything; solving hard problems of this type is a core concern in computer science  sampling: there are also many specialized methods for randomly generating solutions likely to be “good” and seeing what model features are preserved across most solutions; this is a core concern of statisticians

Network Inference in Practice The methods covered here are the key ideas behind how people really infer networks from complex data The practice is usually more complicated, though: many kinds of data sources, specialized prior probabilities, lots of algorithmic tricks needed to get good results If you really want to know the details, these topics are typically covered in a class on machine learning

Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.

Similar presentations

Presentation on theme: "Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.

Similar presentations

Presentation on theme: "Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University."— Presentation transcript:

Similar presentations

About project

Feedback