# CGeMM – University of Louisville Mining gene-gene interactions from microarray data - Coefficient of Determination Marcel Brun – CGeMM - UofL.

## Presentation on theme: "CGeMM – University of Louisville Mining gene-gene interactions from microarray data - Coefficient of Determination Marcel Brun – CGeMM - UofL."— Presentation transcript:

CGeMM – University of Louisville Mining gene-gene interactions from microarray data - Coefficient of Determination Marcel Brun – CGeMM - UofL

CGeMM – University of Louisville Gene 1Gene 2Gene 3Gene 4Gene 5 Gene 1Gene 2Gene 3Gene 4Gene 5 Goal : Determination of the predictive genetic network Coefficient of Determination (CoD)

CGeMM – University of Louisville Coefficient of Determination (CoD) Assume Gene G 3 is biologically regulated by Genes G 1 and G 2 What is the relationship ? - We don’t know, but we can “guess” it using Boolean functions (functions operating on the set {0,1}) Gene G 1 Gene G 2 Gene G 3

CGeMM – University of Louisville If the expression of the genes is assumed to have 2 possible values (0 – inactive,1 - active), we can use Boolean functions to “model” the relationship between the genes. Ternary Functions G1G1 G3G3 G2G2  G1G1 G2G2  (G 1,G 2 ) 000 010 101 110 One example of a Boolean function All possible combination of values for the pair {G 1,G 2 }

CGeMM – University of Louisville The behavior of the gene G 3 can also be predicted by a constant function. In this case G 3 doesn’t depend on G 1 and G 2, so we can write “  =  0 = c ” to specify the function (The subindex 0 in  0 denotes the absence of predictors) Constants Functions G1G1 G3G3 G2G2  G1G1 G2G2  = c 000 010 100 110 Example of a constant function

CGeMM – University of Louisville The Boolean function model is for the biological model, NOT for the observed data !!! Each binary function mimics the biological behaviour with some degree of fitness. The quality of this fitness can be measured via an error measure There is always an optimal binary function, that best fits the biological model. Note

CGeMM – University of Louisville Activity of gene 1 (promotor) promotes the activation of gene 3, unless gene 2 is active (represor). Example Gene 1 Gene 3 Gene 2 G1G1 G2G2  (G 1,G 2 ) 000 010 101 110 A possible Boolean function to represent this biological relationship

CGeMM – University of Louisville Example

CGeMM – University of Louisville How good is this function  to “model” the relationship between G 1,G 2 and G 3 ? The quality of the function  depends on the “joint” distribution of G 1,G 2 and G 3 In the same way, if the constant function is defined by  0 =c Error measure for ternary functions

CGeMM – University of Louisville Between all possible Boolean functions , one of them has the minimal error, as predictor of G 3 from G 1 and G 2. This function is called  opt.  [  opt]   [  ] for any other Boolean function  If G 1 and G 2 are good predictors of G 3, then the relationship between them will be “captured” by  opt and  [  opt ] will be small. The optimal constant predictor is called  0-opt. (there are only 2 possible constant predictors: 0 and 1). If G 3 is almost constant, then  [  0-opt ] will be small. Optimal Function

CGeMM – University of Louisville The Coefficient of Determination (CoD) of the pair of genes G 1 and G 2 as predictors of the gene G 3 is given by the relative improvement in the prediction when using the optimal predictor  opt over the optimal constant predictor  0-opt. The CoD depends ONLY on the joint distribution of G 1,G 2 and G 3. (Depends on the biological behaviour and NOT in the observed data) Coefficient of determination

CGeMM – University of Louisville Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 7 Exp 8 G1G1 11011111 G2G2 01110101 G3G3 01010110 ……………………… Microarray Example of Ternary Expression Matrix Estimated CoD for {G 1,G 2 } as predictors of G 3 Estimation of the optimal functions  opt and  0-opt for for {G 1,G 2 } as predictors of G 3 Estimation of the CoD for G1,G2 and G3.

CGeMM – University of Louisville Ternary Expression Matrix for G 1,G 2 and G 3 Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Exp 7 Exp 8 G1G1 11011111 G2G2 01110101 G3G3 01010100 TRAINExp 1 Exp 2 Exp 3 Exp 4 G1G1 1101 G2G2 0111 G3G3 0101 TESTExp 5 Exp 6 Exp 7 Exp 8 G1G1 1111 G2G2 0101 G3G3 0100 Splitting of the matrix in Training and Test sets Estimation of  [  opt ] for G1,G2 and G3 from the data

CGeMM – University of Louisville TRAINExp 1 Exp 2 Exp 3 Exp 4 G1G1 1101 G2G2 0111 G3G3 0101 G1G1 G2G2  opt (G 1,G 2 ) 00X0 0100 1000 1111 Statistical Inference of the optimal function  opt. Generalization to fill non- observed configurations More frequent value computed from data (X denotes a non- observed configuration) Estimation of  [  opt ] for G1,G2 and G3 from the data TESTExp 5 Exp 6 Exp 7 Exp 8 G1G1 1111 G2G2 0101 G3G3 0100 Estimation of the error of  [  opt ] from test set 1 mistake on 4   *[  opt ] = 0.25

CGeMM – University of Louisville Estimation of  [  0-opt ] for G 1,G 2 and G 3 from the data TESTExp 1 Exp 2 Exp 3 Exp 4 G1G1 1101 G2G2 0111 G3G3 0101 Statistical Inference of the optimal function  0-opt. Frequencies of possible values of G 3 on train data G3G3 Frequency 02 12 TESTExp 5 Exp 6 Exp 7 Exp 8 G1G1 1111 G2G2 0101 G3G3 0100  0-opt. = 1 (use heuristic) (most frequent observed value for G 3 ) Estimation of the error of  [  opt ] from test set 3 mistakes on 4   *[  0-opt ] = 0.75

CGeMM – University of Louisville Estimation of the CoD for G 1,G 2 and G 3 from the data  *[  0-opt ] = 0.75  *[  opt ] = 0.25 The error is reduced in a 66 %

CGeMM – University of Louisville The previous process is repeated 1000 times, with different random splitings of the set in training and test sets. The estimated value for the CoD is the average of the 1000 values of  *. If we want to know the predictive power of other pair of genes, say G 4,G 5, over G 3, we must repeat the whole process G 1,G 2  G 3  3 12 G 4,G 5  G 3   3 45 Estimation of the CoD for G 1,G 2 and G 3.

CGeMM – University of Louisville Compute the CoD for all sets of 1,2 and 3 predictors for each target gene. Gene 2 Gene 3 Gene 4 Gene 5 Gene 1 Gene 2 & 3 Gene 2 & 4 Gene 2 & 5 Gene 3 & 4 Gene 3 & 5 Gene 4 & 5 Gene 2,3,4 Gene 2,3,5 Gene 2,4,5 Gene 3,4,5 1 predictor2 predictors3 predictors Quality of prediction : CoD Methodology

CGeMM – University of Louisville Gene 2 Gene 1 Gene 2 e 3Gene 2,3,4 22  23  234 Results Most probable predictors sets for each gene

CGeMM – University of Louisville Gene 1Gene 2Gene 3Gene 4Gene 5 Gene 1Gene 2Gene 3Gene 4Gene 5 Determination of the predictive genetic network Result

CGeMM – University of Louisville The CoD can be applied to ternary data, more general discrete data and on continuous data, restricting the family of functions (linear, neural network, etc) This technique is a “feature selection” technique analizyng all the possibilities. Exisiting algorithms can be applied to optimize the search, in detriment of the quality of the result (ex: genetic algorithm, su- optimal search) Intrinsic dependencies between the variables can hide true relationships: bayesian networks ?? Discussion

CGeMM – University of Louisville CoD is a useful tool in the determination of the predictive genetic network Computationaly expensive: feasible only for 3 predictor sets for moderate sets 200-500 genes Does not give information about the functions, but they can be estimated easily from the data Conclusions DataBiologyModel Exp CoD

Download ppt "CGeMM – University of Louisville Mining gene-gene interactions from microarray data - Coefficient of Determination Marcel Brun – CGeMM - UofL."

Similar presentations