Presentation is loading. Please wait.

Presentation is loading. Please wait.

CGeMM – University of Louisville Mining gene-gene interactions from microarray data - Coefficient of Determination Marcel Brun – CGeMM - UofL.

Similar presentations


Presentation on theme: "CGeMM – University of Louisville Mining gene-gene interactions from microarray data - Coefficient of Determination Marcel Brun – CGeMM - UofL."— Presentation transcript:

1 CGeMM – University of Louisville Mining gene-gene interactions from microarray data - Coefficient of Determination Marcel Brun – CGeMM - UofL

2 CGeMM – University of Louisville Gene 1Gene 2Gene 3Gene 4Gene 5 Gene 1Gene 2Gene 3Gene 4Gene 5 Goal : Determination of the predictive genetic network Coefficient of Determination (CoD)

3 CGeMM – University of Louisville Coefficient of Determination (CoD) Assume Gene G 3 is biologically regulated by Genes G 1 and G 2 What is the relationship ? - We don’t know, but we can “guess” it using Boolean functions (functions operating on the set {0,1}) Gene G 1 Gene G 2 Gene G 3

4 CGeMM – University of Louisville If the expression of the genes is assumed to have 2 possible values (0 – inactive,1 - active), we can use Boolean functions to “model” the relationship between the genes. Ternary Functions G1G1 G3G3 G2G2  G1G1 G2G2  (G 1,G 2 ) One example of a Boolean function All possible combination of values for the pair {G 1,G 2 }

5 CGeMM – University of Louisville The behavior of the gene G 3 can also be predicted by a constant function. In this case G 3 doesn’t depend on G 1 and G 2, so we can write “  =  0 = c ” to specify the function (The subindex 0 in  0 denotes the absence of predictors) Constants Functions G1G1 G3G3 G2G2  G1G1 G2G2  = c Example of a constant function

6 CGeMM – University of Louisville The Boolean function model is for the biological model, NOT for the observed data !!! Each binary function mimics the biological behaviour with some degree of fitness. The quality of this fitness can be measured via an error measure There is always an optimal binary function, that best fits the biological model. Note

7 CGeMM – University of Louisville Activity of gene 1 (promotor) promotes the activation of gene 3, unless gene 2 is active (represor). Example Gene 1 Gene 3 Gene 2 G1G1 G2G2  (G 1,G 2 ) A possible Boolean function to represent this biological relationship

8 CGeMM – University of Louisville Example

9 CGeMM – University of Louisville How good is this function  to “model” the relationship between G 1,G 2 and G 3 ? The quality of the function  depends on the “joint” distribution of G 1,G 2 and G 3 In the same way, if the constant function is defined by  0 =c Error measure for ternary functions

10 CGeMM – University of Louisville Between all possible Boolean functions , one of them has the minimal error, as predictor of G 3 from G 1 and G 2. This function is called  opt.  [  opt]   [  ] for any other Boolean function  If G 1 and G 2 are good predictors of G 3, then the relationship between them will be “captured” by  opt and  [  opt ] will be small. The optimal constant predictor is called  0-opt. (there are only 2 possible constant predictors: 0 and 1). If G 3 is almost constant, then  [  0-opt ] will be small. Optimal Function

11 CGeMM – University of Louisville The Coefficient of Determination (CoD) of the pair of genes G 1 and G 2 as predictors of the gene G 3 is given by the relative improvement in the prediction when using the optimal predictor  opt over the optimal constant predictor  0-opt. The CoD depends ONLY on the joint distribution of G 1,G 2 and G 3. (Depends on the biological behaviour and NOT in the observed data) Coefficient of determination

12 CGeMM – University of Louisville Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 7 Exp 8 G1G G2G G3G ……………………… Microarray Example of Ternary Expression Matrix Estimated CoD for {G 1,G 2 } as predictors of G 3 Estimation of the optimal functions  opt and  0-opt for for {G 1,G 2 } as predictors of G 3 Estimation of the CoD for G1,G2 and G3.

13 CGeMM – University of Louisville Ternary Expression Matrix for G 1,G 2 and G 3 Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 7 Exp 8 G1G G2G G3G TRAINExp 1 Exp 2 Exp 3 Exp 4 G1G G2G G3G TESTExp 5 Exp 6 Exp 7 Exp 8 G1G G2G G3G Splitting of the matrix in Training and Test sets Estimation of  [  opt ] for G1,G2 and G3 from the data

14 CGeMM – University of Louisville TRAINExp 1 Exp 2 Exp 3 Exp 4 G1G G2G G3G G1G1 G2G2  opt (G 1,G 2 ) 00X Statistical Inference of the optimal function  opt. Generalization to fill non- observed configurations More frequent value computed from data (X denotes a non- observed configuration) Estimation of  [  opt ] for G1,G2 and G3 from the data TESTExp 5 Exp 6 Exp 7 Exp 8 G1G G2G G3G Estimation of the error of  [  opt ] from test set 1 mistake on 4   *[  opt ] = 0.25

15 CGeMM – University of Louisville Estimation of  [  0-opt ] for G 1,G 2 and G 3 from the data TESTExp 1 Exp 2 Exp 3 Exp 4 G1G G2G G3G Statistical Inference of the optimal function  0-opt. Frequencies of possible values of G 3 on train data G3G3 Frequency TESTExp 5 Exp 6 Exp 7 Exp 8 G1G G2G G3G  0-opt. = 1 (use heuristic) (most frequent observed value for G 3 ) Estimation of the error of  [  opt ] from test set 3 mistakes on 4   *[  0-opt ] = 0.75

16 CGeMM – University of Louisville Estimation of the CoD for G 1,G 2 and G 3 from the data  *[  0-opt ] = 0.75  *[  opt ] = 0.25 The error is reduced in a 66 %

17 CGeMM – University of Louisville The previous process is repeated 1000 times, with different random splitings of the set in training and test sets. The estimated value for the CoD is the average of the 1000 values of  *. If we want to know the predictive power of other pair of genes, say G 4,G 5, over G 3, we must repeat the whole process G 1,G 2  G 3  3 12 G 4,G 5  G 3   3 45 Estimation of the CoD for G 1,G 2 and G 3.

18 CGeMM – University of Louisville Compute the CoD for all sets of 1,2 and 3 predictors for each target gene. Gene 2 Gene 3 Gene 4 Gene 5 Gene 1 Gene 2 & 3 Gene 2 & 4 Gene 2 & 5 Gene 3 & 4 Gene 3 & 5 Gene 4 & 5 Gene 2,3,4 Gene 2,3,5 Gene 2,4,5 Gene 3,4,5 1 predictor2 predictors3 predictors Quality of prediction : CoD Methodology

19 CGeMM – University of Louisville Gene 2 Gene 1 Gene 2 e 3Gene 2,3,4 22  23  234 Results Most probable predictors sets for each gene

20 CGeMM – University of Louisville Gene 1Gene 2Gene 3Gene 4Gene 5 Gene 1Gene 2Gene 3Gene 4Gene 5 Determination of the predictive genetic network Result

21 CGeMM – University of Louisville The CoD can be applied to ternary data, more general discrete data and on continuous data, restricting the family of functions (linear, neural network, etc) This technique is a “feature selection” technique analizyng all the possibilities. Exisiting algorithms can be applied to optimize the search, in detriment of the quality of the result (ex: genetic algorithm, su- optimal search) Intrinsic dependencies between the variables can hide true relationships: bayesian networks ?? Discussion

22 CGeMM – University of Louisville CoD is a useful tool in the determination of the predictive genetic network Computationaly expensive: feasible only for 3 predictor sets for moderate sets genes Does not give information about the functions, but they can be estimated easily from the data Conclusions DataBiologyModel Exp CoD


Download ppt "CGeMM – University of Louisville Mining gene-gene interactions from microarray data - Coefficient of Determination Marcel Brun – CGeMM - UofL."

Similar presentations


Ads by Google