Presentation on theme: "Interaction-based Learning in Genomics Shaw-Hwa Lo, Tian Zheng & Herman Chernoff Columbia University Harvard University."— Presentation transcript:
Interaction-based Learning in Genomics Shaw-Hwa Lo, Tian Zheng & Herman Chernoff Columbia University Harvard University
Other Collaborators : lulian lonita-Laza, Inchi Hu, Hongyu Zhao,Hui Wang, Xin Yan, Yuejing Ding, Chien-Hsun Huang, Bo Qiao, Ying Liu, Michael Agne, Ruixue Fan, Maggie Wang, Lei Cong, Hugh Arnolds, Jun Xie, Adeline Lo
Partition-Retention We have n observation on a dependent variable Y and many discrete valued explanatory variables X 1,X 2,...,X S. We wish : 1). to identify those of the explanatory variables which influence Y; 2). to predict Y, based on 1)’s findings. We assume that Y is influenced by a number of small groups of interacting variables. ( group sizes~ 1 to 8, depending on sample sizes and effects)
Marginal Effects: Causal and Observable 1. If X i has an effect on Y we expect Y to be correlated with X i or some function of X i. In that case X i has a causal and observational marginal effect. 2. A variable X i unrelated to (independent of) Y should be uncorrelated with Y except for random variation. But if S (numbers of variables) is large and n moderate, some of the explanatory variables not influencing Y may have a substantial correlation ( or marginal observable effects) with Y. They are impostors. 3. Group of important Interacting influential variables may or may not have marginal observable effects (MOE). Therefore, methods rely on the presence of strong observable marginal effects are unlikely to succeed if MOE are weak.
Ex. 1. X 1 and X 2 are independent with P(X i = 1) = P(X i =−1) = 1/2, Y = X 1 X 2, E(Y |X 1 ) = E(Y |X 2 ) = 0. Y is uncorrelated with X 1 and X 2 although the pair determine Y. Ex. 2. Y = X 1 X 2, P(X i = 1) = 3/4 and P(X i = −1) = 1/4. Here Y is correlated with X 1 and X 2, and the sample will clearly show marginal observable effects ( and can be detected by t-test). That is the interaction of both X 1 and X 2 is needed to have an influence on Y. Conclusion: To detect interacting influential variables, it is desirable and sometimes necessary to consider interactive effects. Impostors may present observable marginal effects if S is large and n is moderate.
An ideal analytical tool should have the abilities to: 1. handle an extremely large number of variables and their higher-order interactions. 2. detect “ module effects”, referring to the phenomenon where a module C ( a cluster of variables) holds predictive power but becomes useless in prediction if any variable is removed. 3. identify interaction effects: effect of one variable on a response depends on the values of other variables. 4. detect and utilize nonlinear and non- additive effects.
A score with four features Need A sensible score that can be used to measure the influence of a group of variables. to design an algorithm for removing noisy and non-informative variables while dimensions were altered– meaning this score measures in the same scale in different dimensions Given a cluster of variables, one can use the score to test the significances of its influences the cluster with high score ( influential) automatically possess predictive ability
A Special Case of Influential Measure: Genotype-Trait Distortion In the event of case-control studies: Where and are counts of cases and controls in each genotype (partition element), and are the total number of cases and controls under study. A SNP has 3 genotypes (aa, ab, bb).
A general form Let Y be the disease status (1 for cases and 0 for controls). Then, for a genotype partition П, the score we just discussed can be naturally defined as:
General Setting The main idea applies much more generally than to special genetic problems. A more general version is proposed to deal with the problem of detecting which, of many potentially influential variables X s, have an effect on a dependent variable Y using a sample of n observations on, Z =(X, Y) where X =(X 1,X 2,...,X S ). In the background is the assumption that Y may be slightly or negligibly influenced by each of a few variables X s, but may be profoundly influenced by the confluence of appropriate values within one or a few small groups of these variables. At this stage the object is not to measure the overall effects of the influential variables, but to discover them efficiently.
Example We introduce the partition retention approach and related terminology and issues by considering a small artificial example. Suppose that an observed variable Y is normally distributed with mean X 1 X 2 and variance 1, where X 1 and X 2 are two of S = 6 observed and potentially influential variables which can take on the values 0 and 1. Given the data on Y and X = (X 1,...,X 6 ), for n = 200 subjects, the statistician, who does not know this model, desires to infer which of the six explanatory variables are causally related to Y. In our computation the X i are selected independently to be 1 with probabilities 0.7, 0.7, 0.5, 0.5, 0.5, 0.5.
A capable analytical tool should have the ability to surmount the following difficulties: (a) handle an extremely large number of variables (SNPs and other variables in hundreds of thousands or millions) in the data. (b) detect the so-called “module effect”, which refers to the phenomenon where removing one variable from the current module renders it useless in prediction. (c) identify interaction ( often higher orders effects) : the effect of one variable on a response variable depends on the values of other variables in the same module. (d) extract and utilize nonlinear effects (or non-additive effects).
MethodLDASVMRFLogicFSLLLASSOElastic netProposed Train error.14 ±.03.00 ±.00.13 ±.02.23 ±.05.27 ±.06.21 ±.01 Test error.47 ±.02.50 ±.01.44 ±.04.34 ±.04.45 ±.03.48 ±.04.24 ±.03 Table 1. Classification error rates for the toy example.
Diagrams of conventional approach and the variable-module enabled approach.
Basic tool: the Backward dropping algorithm (BDA). BDA is a “greedy” algorithm that seeks the variable subset that maximizes the I-score through stepwise elimination of variables from an initial subset (k variables) sampled from the variable space (p variables). K << p
Training Set : Consider a training set of n observations, where = X is a p -dimensional vector of discrete variables. Typically p is very large (thousands). Sampling randomly from Variable Space: Select an initial subset of k explanatory variables k << p. Compute I-Score based on k variables. Drop Variables: Tentatively drop each variable and recalculate the -score with one variable less. Then permanently drop the variable that results in the highest -score when tentatively dropped. Return Set: Continue the next round of dropping on until only one variable left. Keep the subset that yields the highest -score in the whole dropping process. Refer to this subset as the return set.
Figure 5. Change of I-Score
Structural diagram of proposed methodology.
Classification based on van’t Veer’s Data (2002). In applying procedures described in Discovery stage, we successfully identified 18 influential modules with sizes ranging from 2 to 6. The purpose of the original study was to predict breast cancer relapse using gene expression data. The original data contains the expression levels of 24,187 genes for 97 patients, 46 relapse (distant metastasis < 5 year) and 51 non-relapse (no distant metastasis ≥ 5 year). We used 4,918 genes for the classification task, which were reduced by Tibashirani and Efron (2002). 78 cases out of 97 were used as the training set (34 relapse and 44 non-relapse) and 19 (12 relapse and 7 non-relapse) as the test set. The best error rates (biased or not) on this particular test set in the literature is around 10% (2 errors). Proposed method yields a zero error rate (no error) on the test set
The CV error rates of the van’t Veer data are typically around 30%. The proposed method yields an average error rate of 8% over 10 randomly selected CV test samples representing a 74% reduction of error rate (30%-8%/ 30%= 74%) when compared with existing methods. We run the CV experiment by randomly partitioning the 97 patients into a training sample of size 87 and a test sample of 10, then repeated the experiment ten times
Example Using Breast Cancer Data GeneLocusSNPsGenesLocusSNPs CASP82q33-q3412SLC22A1811p15.516 TP5317p13.16BARD12q34-q3527 ATM11q22-q2312BRCA117q2113 PIK3CA3q26.38KRAS212p12.128 PHB17q2110ESR16q2578 BRCA213q12.331BRIP117q22-q2419 RB1CC18q119RAD5115q15.14 PPM1D17q23.22TSG10111p1511 PALB216p12.17CHEK222q12.111 Case-Control Sporadic Breast Cancer data from NCI Cancer Genetic Markers of Susceptibility (CGEMS) 2287 postmenopausal women 1145 cases and 1142 controls 18 genes with 304 SNPS selected from literatures:
Under the null estimated by permutations. P-values of the observed marginal effects
Remarks One limitation of marginal approaches is due in part that only a fractional information from the data is used; The proposed approach intends to draw more relevant information ; Improving prediction; Additional scientific findings are likely if data already collected be suitably reanalyzed; The proposed approach is particularly useful when a large number of dense markers becomes available; Information about gene-gene interactions and their disease-networks can be derived and constructed.
Collaborators Herman Chernoff, Tian Zheng, lulian lonita-Laza, Inchi Hu, Hongyu Zhao,Hui Wang, Xin Yan, Yuejing Ding, Chien-Hsun Huang, Bo Qiao, Ying Liu, Michael Agne, Ruixue Fan, Maggie Wang, Lei Cong, Hugh Arnolds, Jun Xie, Kjell Doksum
Key References Lo SH, Zheng T (2002) Backward haplotype transmission association (BHTA) algorithm—a fast multiple-marker screening method. Human Heredity 53 (4): 197- 215. Lo SH, Zheng T (2004) A demonstration and findings of a statistical approach through reanalysis of inflammatory bowel disease data. PNAS U S A 101(28):10386-91 Lo SH, Chernoff,H., Cong,L., Ding,Y.,Zheng,T.(2008) Discovering Interactions Among BRCA1 and Other Candidate Genes Involved in Sporadic Breast Cancer. PNAS 105:12387-12392.PNAS 105:12387-12392 Chernoff H, Lo SH, Zheng T (2009) Discovering Influential Variables: A Method of Partitions. Annals of Applied Statistics. 3.(4): 1335-1369. Wang H., Lo SH, Zheng T &Hu I (2012) Interaction-based feature selection and classification for high-dimensional biological data. Bioinformatics 28(21): 2834- 2842.