Presentation is loading. Please wait.

Presentation is loading. Please wait.

Interaction-based Learning in Genomics

Similar presentations

Presentation on theme: "Interaction-based Learning in Genomics"— Presentation transcript:

1 Interaction-based Learning in Genomics
Shaw-Hwa Lo, Tian Zheng & Herman Chernoff Columbia University Harvard University

2 Other Collaborators : lulian lonita-Laza, Inchi Hu, Hongyu Zhao,Hui Wang, Xin Yan, Yuejing Ding, Chien-Hsun Huang, Bo Qiao, Ying Liu, Michael Agne, Ruixue Fan, Maggie Wang, Lei Cong, Hugh Arnolds, Jun Xie, Adeline Lo

3 Partition-Retention We have n observation on a dependent variable Y and many discrete valued explanatory variables X1,X2, ,XS. We wish : 1). to identify those of the explanatory variables which influence Y; 2). to predict Y, based on 1)’s findings. We assume that Y is influenced by a number of small groups of interacting variables. ( group sizes~ 1 to 8, depending on sample sizes and effects)

4 Marginal Effects: Causal and Observable
1. If Xi has an effect on Y we expect Y to be correlated with Xi or some function of Xi. In that case Xi has a causal and observational marginal effect. 2. A variable Xi unrelated to (independent of) Y should be uncorrelated with Y except for random variation. But if S (numbers of variables) is large and n moderate, some of the explanatory variables not influencing Y may have a substantial correlation ( or marginal observable effects) with Y . They are impostors. 3. Group of important Interacting influential variables may or may not have marginal observable effects (MOE). Therefore, methods rely on the presence of strong observable marginal effects are unlikely to succeed if MOE are weak.

5 Ex X1 and X2 are independent with P(Xi = 1) = P(Xi =−1) = 1/2, Y = X1X2, E(Y |X1) = E(Y |X2) = 0. Y is uncorrelated with X1 and X2 although the pair determine Y . Ex Y = X1X2, P(Xi = 1) = 3/4 and P(Xi = −1) = 1/4. Here Y is correlated with X1 and X2, and the sample will clearly show marginal observable effects ( and can be detected by t-test). That is the interaction of both X1 and X2 is needed to have an influence on Y . Conclusion: To detect interacting influential variables, it is desirable and sometimes necessary to consider interactive effects. Impostors may present observable marginal effects if S is large and n is moderate.

6 An ideal analytical tool should have the abilities to:
1. handle an extremely large number of variables and their higher-order interactions. 2. detect “ module effects”, referring to the phenomenon where a module C ( a cluster of variables) holds predictive power but becomes useless in prediction if any variable is removed. 3. identify interaction effects: effect of one variable on a response depends on the values of other variables. 4. detect and utilize nonlinear and non-additive effects.

7 A score with four features
Need A sensible score that can be used to measure the influence of a group of variables. to design an algorithm for removing noisy and non-informative variables while dimensions were altered– meaning this score measures in the same scale in different dimensions Given a cluster of variables, one can use the score to test the significances of its influences the cluster with high score ( influential) automatically possess predictive ability

8 A Special Case of Influential Measure: Genotype-Trait Distortion
In the event of case-control studies: Where and are counts of cases and controls in each genotype (partition element) , and are the total number of cases and controls under study. A SNP has 3 genotypes (aa, ab, bb).

9 A general form Let Y be the disease status (1 for cases and 0 for controls). Then, for a genotype partition П, the score we just discussed can be naturally defined as:

10 Theorem: Under the null hypothesis that none of the variables has an influence, the null distribution of 𝐼 Π when 𝑌 is normalized, is asymptotically a weighted sum of independent Chi-square variables. This applies to both the null random-𝑌 and the null specified-𝑌 models, under the standard conditions for the applicability of the CLT. Certainly the case-control studies ( 𝑌=1 𝑜𝑟 0) is a special case of specified -𝑌 models. (2009)

11 Example

12 General Setting The main idea applies much more generally than to special genetic problems. A more general version is proposed to deal with the problem of detecting which, of many potentially influential variables Xs, have an effect on a dependent variable Y using a sample of n observations on, Z =(X, Y) where X =(X1,X2, ,XS). In the background is the assumption that Y may be slightly or negligibly influenced by each of a few variables Xs, but may be profoundly influenced by the confluence of appropriate values within one or a few small groups of these variables. At this stage the object is not to measure the overall effects of the influential variables, but to discover them efficiently.

13 Example We introduce the partition retention approach and related terminology and issues by considering a small artificial example. Suppose that an observed variable Y is normally distributed with mean X1X2 and variance 1, where X1 and X2 are two of S = 6 observed and potentially influential variables which can take on the values 0 and 1. Given the data on Y and X = (X1, ,X6), for n = 200 subjects, the statistician, who does not know this model, desires to infer which of the six explanatory variables are causally related to Y. In our computation the Xi are selected independently to be 1 with probabilities 0.7, 0.7, 0.5, 0.5, 0.5, 0.5.

14 Example

15 Example

16 Example

17 Example

18 Example

19 Example

20 A capable analytical tool should have the ability to surmount the following difficulties:
(a) handle an extremely large number of variables (SNPs and other variables in hundreds of thousands or millions) in the data. (b) detect the so-called “module effect”, which refers to the phenomenon where removing one variable from the current module renders it useless in prediction. (c) identify interaction ( often higher orders effects) : the effect of one variable on a response variable depends on the values of other variables in the same module. (d) extract and utilize nonlinear effects (or non-additive effects).

21 Let , the response variable Y, and X , the explanatory variables (30 Xs, all independent), all be binary, taking values 0 or 1 with 50% chance each. We independently generate 200 observations and Y is related to X via the model 𝑌= { 𝑋 4 + 𝑋 𝑚𝑜𝑑 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏 𝑋 1 + 𝑋 2 + 𝑋 𝑚𝑜𝑑 2 𝑤𝑖𝑡ℎ 𝑝𝑟𝑜𝑏. 0.5 The task is to predict Y based on the information in X. We use 150 observations as the training set and 50 as the test set. This example has a 25% theoretical lower bound for prediction error rates since we do not know which of the two causal variable modules generates the response Y.

22 Table 1. Classification error rates for the toy example.
Method LDA SVM RF LogicFS LL LASSO Elastic net Proposed Train error .14 ± .03 .00 ± .00 .13 ± .02 .23 ± .05 .27 ± .06 .21 ± .01 Test error .47 ± .02 .50 ± .01 .44 ± .04 .34 ± .04 .45 ± .03 .48 ± .04 .24 ± .03

23 Diagrams of conventional approach and the variable-module enabled approach.

24 Basic tool: the Backward dropping algorithm (BDA)
Basic tool: the Backward dropping algorithm (BDA). BDA is a “greedy” algorithm that seeks the variable subset that maximizes the I-score through stepwise elimination of variables from an initial subset (k variables) sampled from the variable space (p variables). K << p

25 Training Set : Consider a training set of n observations, where = X is a p -dimensional vector of discrete variables. Typically p is very large (thousands). Sampling randomly from Variable Space: Select an initial subset of k explanatory variables k << p. Compute I-Score based on k variables. Drop Variables: Tentatively drop each variable and recalculate the -score with one variable less. Then permanently drop the variable that results in the highest -score when tentatively dropped. Return Set: Continue the next round of dropping on until only one variable left. Keep the subset that yields the highest -score in the whole dropping process. Refer to this subset as the return set.

26 Figure 5. Change of I-Score

27 Structural diagram of proposed methodology .

28 Classification based on van’t Veer’s Data (2002) .
In applying procedures described in Discovery stage , we successfully identified 18 influential modules with sizes ranging from 2 to 6. The purpose of the original study was to predict breast cancer relapse using gene expression data. The original data contains the expression levels of 24,187 genes for 97 patients, 46 relapse (distant metastasis < 5 year) and 51 non-relapse (no distant metastasis ≥ 5 year). We used 4,918 genes for the classification task, which were reduced by Tibashirani and Efron (2002). 78 cases out of 97 were used as the training set (34 relapse and 44 non-relapse) and 19 (12 relapse and 7 non-relapse) as the test set. The best error rates (biased or not) on this particular test set in the literature is around 10% (2 errors). Proposed method yields a zero error rate (no error) on the test set

29 The CV error rates of the van’t Veer data are typically around 30%
The CV error rates of the van’t Veer data are typically around 30%. The proposed method yields an average error rate of 8% over 10 randomly selected CV test samples representing a 74% reduction of error rate (30%-8%/ 30%= 74%) when compared with existing methods. We run the CV experiment by randomly partitioning the 97 patients into a training sample of size 87 and a test sample of 10, then repeated the experiment ten times

30 In case-control design when there are n cases and n controls in a study, the last line of the equations, divided by 𝑛 2 will converge to 𝑗𝜖Π [ 𝑝 𝑗 𝑑 −𝑝 𝑗 𝑐 ] 2 , where 𝑝 𝑗 𝑤 Is the class probability. (two classes, case vs control). This expression is directly related to the correct predictive rate corresponding to the partition Π. Thus searching for cluster with larger I-score has the automatic effect of seeking clusters with stronger predictive ability---- a very desirable property.

31 Example Using Breast Cancer Data
Case-Control Sporadic Breast Cancer data from NCI Cancer Genetic Markers of Susceptibility (CGEMS) 2287 postmenopausal women 1145 cases and 1142 controls 18 genes with 304 SNPS selected from literatures: Gene Locus SNPs Genes CASP8 2q33-q34 12 SLC22A18 11p15.5 16 TP53 17p13.1 6 BARD1 2q34-q35 27 ATM 11q22-q23 BRCA1 17q21 13 PIK3CA 3q26.3 8 KRAS2 12p12.1 28 PHB 10 ESR1 6q25 78 BRCA2 13q12.3 31 BRIP1 17q22-q24 19 RB1CC1 8q11 9 RAD51 15q15.1 4 PPM1D 17q23.2 2 TSG101 11p15 11 PALB2 16p12.1 7 CHEK2 22q12.1

32 Under the null estimated by permutations
Under the null estimated by permutations. P-values of the observed marginal effects

33 Quantile-Ratio Method
Results Mean-Ratio Method Quantile-Ratio Method Gene Pair Curve p-value Rank p-value 1 ESR1 – BRCA1 0.017 ≤ 0.001 0.013 0.001 2 BRCA1 – PHB 0.026 0.040 0.029 0.073 3 KRAS2 – BRCA1 0.002 0.006 0.004 4 SLC22A18 – BRCA1 0.032 0.072 0.019 0.079 5 RAD51 – BRCA1 0.052 0.090 0.005 6 RB1CC1 – SLC22A18 0.024 ESR1 – SLC22A18 0.033 0.016 7 CASP8 – KRAS2 0.043 0.038 0.009 0.008 8 CASP8 – SLC22A18 0.042 0.048 0.036 9 PIK3CA – BRCA1 0.030 0.021 0.012 10 PIK3CA – ESR1 0.047 0.014 0.049 11 PIK3CA – RB1CC1 0.051 12 PIK3CA – SLC22A18 0.025 0.044 0.053 13 BRCA1 – CHEK2 0.031 CASP8 – PIK3CA 0.007 14 BARD1 – BRCA1 0.057 0.022 15 BARD1 – ESR1 0.003 0.015 16 BARD1 – TP53 17 0.010 18 BARD1 – SLC22A18 0.056 0.063 CASP8 – ESR1 0.071 0.066 BARD1 – KRAS2 0.055 ESR1 – KRAS2 0.145 0.103 ESR1 – PPM1D 0.252 0.348

34 Two-way Interaction Networks
Pair-wise network based on16 pairs of genes identified by Mean-ratio Method. Pair-wise network based on 18 pairs of genes identified by Quantile-ratio method.

35 Three-way Interaction Networks
3-way interaction network based on 10 genes identified by Mean-ratio method 3-way interaction network based on 8 genes identified by Quantile-ratio method

36 Pairwise Interaction (M, R)-plane: observed data and permutation quantiles 1-ESR1 BRCA1, 2-BRCA1 PHB, 3-KRAS2 BRCA1, 4- SLC22A18 BRCA1, 5-RAD51 BRCA1, 6-RB1CC1 SLC22A18, 7-CASP8 KRAS2, 8-CASP8 SLC22A18, 9-PIK3CA BRCA1, 10-PIK3CA ESR1, 11-PIK3CA RB1CC1, 12-PIK3CA SLC22A18, 13-BRCA1 CHEK2, 14-BARD1 BRCA1, 15-BARD1 ESR1, 16-BARD1 TP53 (M, Q)-plane: observed data and permutation quantiles 1-ESR1 BRCA1, 2-BRCA1 PHB, 3-KRAS2 BRCA1, 4-SLC22A18 BRCA1, 5-RAD51 BRCA1, 6-ESR1 SLC22A18, 7-RB1CC1 SLC22A18, 8-CASP8 KRAS2, 9-CASP8 SLC22A18, 10-PIK3CA BRCA1,11-PIK3CA ESR1, 12-PIK3CA RB1CC1, 13-CASP8 PIK3CA, 14-BRCA1 CHEK2, 15-BARD1 BRCA1, 16-BARD1 ESR1, 17-BARD1 TP53, 18-BARD1 SLC22A18

37 Remarks One limitation of marginal approaches is due in part that only a fractional information from the data is used; The proposed approach intends to draw more relevant information ; Improving prediction; Additional scientific findings are likely if data already collected be suitably reanalyzed; The proposed approach is particularly useful when a large number of dense markers becomes available; Information about gene-gene interactions and their disease-networks can be derived and constructed.

38 Collaborators Herman Chernoff, Tian Zheng, lulian lonita-Laza, Inchi Hu, Hongyu Zhao,Hui Wang, Xin Yan, Yuejing Ding, Chien-Hsun Huang, Bo Qiao, Ying Liu, Michael Agne, Ruixue Fan, Maggie Wang, Lei Cong, Hugh Arnolds, Jun Xie, Kjell Doksum

39 Key References Lo SH, Zheng T (2002) Backward haplotype transmission association (BHTA) algorithm—a fast multiple-marker screening method. Human Heredity 53 (4): Lo SH, Zheng T (2004) A demonstration and findings of a statistical approach through reanalysis of inflammatory bowel disease data. PNAS U S A 101(28): Lo SH, Chernoff,H., Cong,L., Ding,Y.,Zheng,T.(2008) Discovering Interactions Among BRCA1 and Other Candidate Genes Involved in Sporadic Breast Cancer. PNAS 105: Chernoff H, Lo SH, Zheng T (2009) Discovering Influential Variables: A Method of Partitions. Annals of Applied Statistics. 3.(4): Wang H ., Lo SH, Zheng T &Hu I (2012) Interaction-based feature selection and classification for high-dimensional biological data. Bioinformatics 28(21):

Download ppt "Interaction-based Learning in Genomics"

Similar presentations

Ads by Google