Download presentation

Presentation is loading. Please wait.

Published byCathleen Doyle Modified about 1 year ago

1
Selection of Multiple SNPs in Case- Control Association Study Using a Discretized Network Flow Approach Shantanu Dutt, Yang Dai Huan Ren, Joel Fontanarosa University of Illinois at Chicago

2
Outline Background: Genome Wide Association Study Problem Definition Previous Work Our Work: MIP Formulations Discretized Network Flow (DNF) Opt. Method DNF Solutions for k-SNP Selection w/ Clustering/Classification Experimental Results Conclusions

3
Genetic Association Studies Goal: Find markers of variation that reliably distinguish individuals with a disease from a healthy population Single Nucleotide Polymorphisms (SNPs) are the simplest and most common form of variation in the human genome. Each chromosome has one of two alleles for each SNP Possible Genotypes = {0/0, 0/1, 1/1} Variations measured at specific SNP loci have been shown to be associated with numerous traits and diseases. Person 1 chrom 1 chrom 2 SNP Person 2 chrom 1 chrom 2 SNP Person 3 chrom 1 chrom 2 SNP

4
Genetic Association Studies (contd) Genomic Variation Altered Phenotype - Individual traits (eg height, hair color) - Causal factors for disease - Increased risk factors for complex disease Gene, Protein, or Cellular Alteration/Regulation Images: pdb (ww.rcsb.org) Robbins and Cotran, 7 th Ed 2005

5
Genetic Association Studies (contd) Complex traits cannot be mapped to a single genetic locus Multiple interacting genetic influences combine with environmental factors to produce an outcome Gene Networks AB...X Environment Disease

6
Genetic Association Studies (contd) Genome Wide Association Study (GWAS): Measure a large number of SNPs (typically 500K-1M) across the genome in a large case-control study (often >1000 patients) Results are commonly reported based on individual χ 2 values, ignoring potentially powerful interaction effects It remains an open computational and statistical challenge to reliably analyze epistasis, or gene-gene interactions, in large-scale GWAS. Different genetic variations common complex disease Problem Definition: For a given set P of cases and Q of controls, classify the cases into different clusters and simultaneously select k significant marker SNPs for them (those that strongly distinguish these cases from the set Q) In this paper, we present a new optimization technique called discretized network flow (DNF) for the above problem

7
Examples of Epistasis Methods Combinatorial MDR = multifactor dimensionality reduction CSP = combinatorial search based prediction CPM = combinatorial partitioning method Probabilistic BEAM = Bayesian Epistasis Association Mapping Bayesian partitioning model resolved by Markov Chain Monte Carlo (MCMC) methods megaSNPhunter Hierarchical learning algorithm (regression trees) Primarily considers local interaction effects MDR: Ritchie et al, Gen Epid, 2003 CSP: Brinza et al., WABI’06 CPM: Nelson et al, Genome Research, 2001 BEAM: Zhang and Liu, Nature Genetics, 2007 megaSNPhunter: Wan et al, BMC Bioinformatics, 2009

8
MDR 1.Divide data into training and testing sets 2.Select a set of N factors 3.If (affected/unaffected) > T (e.g. T = 1.0) high risk; o/w low risk 4.Select model with best misclassification error 5-6. Estimate the model prediction error using the testing data set. Repeat these steps for each cross validation iteration, and for each possible combination of factors. Adapted from Ritchie et al, Gen Epid, 2003

9
CSP: Combinatorial Methods for Disease Association Search and Susceptibility Prediction Risk/resistance factor multi-SNP combination (MSC) Problem: Find all MSCs significantly associated with the disease Cluster C: subset of S with an MSC, S : the original SNP set d(C) : # of diseased, h(C) : # of non-diseased Combinatorial search Definition: Disease-closure of a multi-SNP combination C is a multi-SNP combination C’, with maximum number of SNPs, which consists of the same set of disease individuals and minimum number of non-disease individuals. Searches only closed clusters Closure of cluster C = C’ d(C’)=d(C) and h(C’) is minimized Avoids checking of trivial MSCs Small d(C) implies not looking in subclusters Finds faster associated MSCs but still too slow Tagging: compress the SNP set by extracting most informative SNPs restore other SNPs from tag SNPs multiple regression method for tagging Brinza, D., Zelikovsky, WABI’06

10
Our Work: MIP Formulation Notations: p i,j (x) (0≤j ≤2): =1 if allele j present on SNP i for individual x; =0, otherwise. Marker m i,j val (val=0,1): m i,j 1 means presence of allele j in SNP i m i,j 0 means absence of allele j in SNP i Per-case benefit function of SNP i and allele j nc is # of controls Claim b i,j (x) is consistent with the specificity provided by selecting marker m i,j pi,j(x) When p i,j (x)=1: b i,j (x) lower fraction of non-patients have p i,j =1= p i,j (x) higher fraction of non-patients have p i,j =0= p i,j (x) When p i,j (x)=0: b i,j (x) higher fraction of non-patients have p i,j =1= p i,j (x)

11
MIP Formulation Benefit-based case-pair similarity metric MIP formulation for selecting one marker set for all patients: Otherwise (indicating m x,y val is not a common marker for patients x and y) d(m i,j val ) =1 if maker m i,j val is selected; np is the # of patients/cases At most k markers will be selected Linear MIP; MIP can be solved with commercial tools such as CPLEX/LINGO. However, very time consuming. The similarity definition ensures that only common markers among patients will be selected.

12
MIP Formulation (contd) Issue 1: Genetic reasons of a disease for diff. patient sets (e.g., w/ different ethnicity) can be different. Hence, selecting only one marker set is not appropriate (it artificially forces one marker set on the entire patient pop). Solution: Simultaneously cluster patients and select different markers for different clusters b x g : if x is in cluster g d g (m i,j val ): if marker m i,j val is selected for cluster g. At most G cluster will be generated. Cubic MIP!

13
MIP Formulation (contd) Issue 2: the sum of benefit is not consistent with the specifity of a set of markers Essentially, the previous formulation will select five common markers with the highest benefit. However, it is not optimal. Mismatch marker 3 Mismatch marker 2 Control set Mismatch marker 1 Mismatch marker 4 Individually, marker 1 and 2 provide larger speicfity than marker 3 and 4 (mismatch more controls). However, the mismatch set of marker 1 and 2 have larger overlap. Select marker 3 and 4 as the marker set gives overall higher specifity

14
MIP Formulation (contd) Adding accurate specifity terms to the obj. func. for each control z : M i (z) : whether control z matches the marker set selected for cluster i; M i (z) is the mod 2 addition (Boolean OR) of various 0/1 vars g mis : objective function gain for mismatching a control. Final objective function At least cubic MIP (if G <= 3) g mis is determined so that specificity and sensitivity are given the same weight. Average gain for a patient matching a marker set: 2kb avg α (np/G), where np is the number of patients, and G is the number of groups. g mis =2kb avg α (np/G)*np/nc

15
s (2,0) (1,2) (1,1) (1,4) (2,0) T Capacity cost Discretized Network Flow (DNF) Standard min-cost network flow Find a min cost way to send a certain amount of flow from the source node (S) to the sink node (T). Solves certain LP problems (continuous solns) Some discrete constraints have to be staisfied in order to solve discrete opt. problems like MIP One such constraint: Mutually exclusive arc set (MEA): At most one arc of a subset of arcs in this set can have flow on it. f=1 MEA Invalid flow Valid flow

16
Satisfying MEA requirements Adding a flow-amount-independent cost C’ to each arc in the set, A constant C’ cost is incurred whenever there is flow on the arc Discretized Network Flow (contd) Standard linear flow cost Cap(e) f c f c C’ With C’ cost C’ MEA sets C’ inv ≥C’ val +C’ C’ inv : total C’-related cost for invalid flow C’ val : total C’-related cost for valid flow

17
C inv C val C val min Without C’ Determining C’: In the standard network flow graph Discretized Network Flow (contd) Heuristically select a valid flow & determine its cost C val Theorem [Ren et al., ICCAD’08]: A min-cost flow with C’-costs on MEA arcs ensures MEA satisfaction Obtain min-cost flow of cost C inv min w/o discretization constraints Set C’=C val -C inv min +1 Since C’ inv ≥C’ val +C’, a valid flow is guaranteed to have a smaller cost than any invalid flow. C val min + C’ val C val + C’ val C inv + C’ inv With C’

18
Discretized Network Flow (contd) Discrete network flow has been applied to VLSI CAD problems [Ren et al., ICCAD’08], [Ren et al., IWLS’08], [Dutt et al., ICCAD’06] Good run time and scalability. At least 10x to 60x times faster than CPLEX with similar quality Example: determine optimal cell sizes in a circuit under an area constraint Four sizes available. The number of 0/1 variables is about four times the number of cells considered. Run time vs. the number of cells from [Ren et al., IWLS’08]

19
DNF Model for Single-Cluster Marker Selection P1P1 PmPm P1P1 …… Complete bi-partite graph with meta arcs PmPm f=np*k f=1 f=np S T Flow through p i,j node in P x means d(m i,j pi,j(x) )=1 Pairwise connection between p i,j nodes ensures the same marker set is selected for all P x The flow cost incurred for selecting a common marker between two patients is: -s(x,y,m i,j pi,j(x) ) From S p 1,1 … (np,0) MEA: only k arcs can have flow (np,0) MEA S1S1 SNSN p 1,2 p 1,3 p N,1 p N,3 MEA … p 1,1 p 1,3 p N,1 p N,3 To T Px Py (np*k,0) (1, -s(x,y,p i,j ci,j(x) )) if c i,j (x)=c i,j (y) No connection otherwise cap cost

20
Marker Selection for Multiple Clusters Use multiple copies of the single cluster network model P1P1 S P2P2 P3P3 P4P4 P1P1 P2P2 P3P3 P4P4 P1P1 P2P2 P3P3 P4P4 P1P1 P2P2 P3P3 P4P4 Complete bipartite T Choice nodes Cluster 1 Cluster 2 Example valid flow: Puts patients {1,4} in cluster 1, and {3,2} in cluster 2. Type 2 invalid flow: Type 1 invalid flow: Flow puts P1 in both cluster 1 and 2 Flow thru P1 passes thru P2 that is not in the same cluster, incurring false costs. MEA MEA prevents invalid flows For a G clusters will have G copies of the 2-level compl. bipartite graph; not all G clusters may be formed

21
Marker Selection for Multiple Clusters Issue: When G is large, the network flow graph become very complex We use iterative bi-partitioning instead Much harder bi-part prob than standard bi-part; bi-part criterion needs to be selected simultaneously w/ bi-part! Condition for stopping the bi- partitioning of a cluster: The spec+sens deteriorates Meet termination condition Final solution Another run-time reduction technique: Patient pre-clustering Group patients before using DNF. Greedy iterative grouping method Initially, each patient is a subgroup Each time merge the two subgroups with most common SNP-allele pairs. Termination condition: patients in one group must have at least 70% SNP-allele pairs in common. Each group is taken as a “meta patient” in DNF Groups opened up after DNF, and metrics eval. at the individual level

22
Chain Structure for Improving Specificity One chain structure for each controls. Two subchains: mismatched (MM) chain and matched (M) chain. One injection arc to M subchain from each cluster: A 1......A g. Injection flow on arc A i means z matches the selected marker set of cluster i (M i (z)=1). Any injection flow causes the MEA condition to force chain flow into M chain, and never switch back. Hence, incur 0 cost. Chain flow stays on the MM chain if no injection arc has flow, and incurs cost of -g mis Cluster 1Cluster 2Cluster g From S T cost=-g mis MM chain M chainMEA A1A1 (1,0) A2A2 AgAg cost=0 Chain structure for control z (cap, cost)

23
Test 2 Test 1 Experimental Results Data set we use Crohn’s disease: 144 cases, 243 controls and 103 SNPs Autoimmune disorder: 384 cases, 652 controls and 108 SNPs Tick-borne encephalitis: 21 cases, 54 controls and 41 SNPs Rheumatoid arthritis: 460 cases, 460 controls and 2300 SNPs Lung cancer: 322 cases, 273 controls and 141 SNPs Rheumatoid arthritis (large): 868 cases, 1194 controls and 5000 SNPs Prediction scheme with multiple cluster marker sets Machine configurations: 3G cpu, 1G mem, Windows machine. Marker set 1 Marker set 2 Mismatch Match Predict as sick Mismatch Predict as healthy TP: correctly predicted as sick FP: falsely predicted as sick TN: correctly predicted as healthy FN: falsely predicted as healthy Sensitivity=TP/(FN+TP) Specifity=TN/(FP+TN) Accuracy=(TN+TP)/(FP+TN+FN+TP)

24
Experimental Results 38% relatively 79% relatively # of clusters K=5K=10 Autoimm.1216 Crohn.1216 Tick- borne 66 Lung cancer 1416 Rheum1314 Five-fold cross validation K=10 results for Rheum. (large, no comparisons available): sens: 85; spec: 80; accuracy: 82 ;10 clusters; 21.5 h per training run Comparisons to MDR: Sensitivity 87.6 56.7 81.9 48.8 88.4 Specifity 78.1

25
Experimental Results 36% relatively 2.4% relatively Comparisons to CSP [Brinza commun. 4/09, Brinza et al., WABI’06 ppt: http://www.cs.ucsd.edu/~dbrinza/cv/present/brinza_wabi06.ppt ] Leave-one-out validation For DNF, 20 runs are performed with randomly chosen left-out individuals CSP performs n runs for n individuals (cases+controls) 85 83.1 96.671.1 SpecifitySensitivity 18% relatively 90.6 76.8 Geometric mean of sens. and spec. 8 times 3k 24k Run time (ksecs, per leave-out run)

26
Experimental Results 19% relatively Leave-one-out validation Accuracy Autoimm.18 Crohn.16 Tick-borne6 Lung cancer 17 Rheum14 90.8 76.6 Average number of clusters

27
Experimental Results Comparing to LINGO (<= 20% from optimal setting) Same MIP formulation is solved by LINGO, and we compare the MIP objective function value and run time with DNF. Comparisons are for 1 iteration of bi-partitioning and quad-partitioning (i.e. G=2,4) Bi-p normalized quality (DNF is 1, the larger the better) 0.96 Bi-p normalized run time (DNF is 1, smaller is better) 15 Quad-p normalized quality (DNF is 1) 0.95 23 Quad-p normalized run time (DNF is 1, smaller is better)

28
Experimental Results Run time vs. number of SNPs Rheumatoid arthritis data set is used Randomly chosen 100, 200, 400, 800, 1600, 2300 SNPs Run time vs. number of patients Crohn’s disease data set is used No patient pre-clustering. Randomly chosen 30, 60, 90, 120, 144, patients from the data set

29
Conclusions We proposed 0/1 non-linear MIP formulations to identify disease markers. We consider patient clustering to identify most appropriate marker sets The discretized network flow (DNF) method is used to efficiently solve the MIP formulations. A chain structure is used for improving specificity Significant improvements compared to MDR and CSP Also much faster run times Can apply DNF to other computationally challenging bioinfo problems since: DNF can efficiently & near-optimally solve polynomial and Boolean MIPs DNF can also efficiently & near-optimally solve other discrete optimization problems

30
If there is no flow on A k Appendix: Generating Injection Flow …… M i,j val nodes that mismatch NP z (1,0) Draining arc (1,-inf) M chain AkAk (1,0) S To T A k and A k are coupled by a draining arc. Cluster k First a complementary injection flow is generated on a complementary arc A k, which is 1 if any mismatched marker for NP z is selected Flow will be drained from A k, and cause injection flow to the chain AkAk cap cost (1,C’) (2,C’) MM chain To T (1,0) If there is flow on A k Flow towards A k is shunted to sink

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google