Association Analysis Techniques for Bioinformatics Problems

Association Analysis Techniques for Bioinformatics Problems
Vipin Kumar University of Minnesota Group members: Gaurav Pandey, Gowtham Atluri, Gang Fang, Rohit Gupta, Michael Steinbach BICoB April 8th, 2009

Data Mining for Biomedical Informatics
Recent technological advances are helping to generate large amounts of genomic data Gene and protein sequences Gene-expression data Biological networks and phylogenetic profiles Single Nucleotides Polymorphisms (SNPs) Data mining offers potential solution for analysis of large-scale data Prediction of the functions of anonymous genes Identification and summarizarion of functional modules Associations between genotypes and phenotype of interest Protein Interaction Network 2

Association Analysis Set-Based Representation of Data Association analysis: Analyzes relationships among items (attributes) in a binary transaction data Example data: market basket data Applications in business and science Marketing and Sales Promotion Identification of functional modules from protein complexes Noise removal from protein interaction data Two types of patterns Itemsets: Collection of items Example: {Milk, Diaper} Association Rules: X  Y, where X and Y are itemsets. Example: Milk  Diaper

Association Analysis Process of finding interesting patterns:
Find frequent itemsets using a support threshold Find association rules for frequent itemsets Sort association rules according to confidence Support filtering is necessary To eliminate spurious patterns To avoid exponential search Support has anti-monotone property: X  Y implies (Y) ≤ (X) Confidence is used because of its interpretation as conditional probability Has well-known limitations More advanced measures: H-confidence Given d items, there are 2d possible candidate itemsets

Different Association patterns
Traditional frequent pattern Hyperclique pattern Error-tolerant frequent pattern Discriminative pattern Case Controls

Association Analysis for Identification of Functional Modules from Protein Complex Data
A protein complex is a group of two or more associated proteins formed by protein-protein interaction that is stable over time Gavin et al (2002)’s data set contains 232 complexes covering 1361 proteins Can be represented as a market-basket matrix Hypercliques derived from this data can be used to discover frequently occurring groups of proteins in several complexes Likely to constitute functional modules Complexes Proteins c1 p1, p2 c2 p1, p3, p4, p5 c3 p2, p3, p4, p6 Xiong et al, “Identification of Functional Modules in Protein Complexes via Hyperclique Pattern Discovery”, Proc. Pac. Symp. Biocomput. , pp , 2005. 6

Function Coherence of Hypercliques
GO enrichment of {PRE2 PRE4 PRE5 PRE6 PRE8 PRE9 PUP3 SCL1}.

Association Analysis-based Pre-processing of Protein Interaction Networks
Protein interaction data faces significant data quality challenges Significant noise and incomplete [Hart et al, 2006] Adverse impact on accuracy of functional inferences [Deng et al, 2003] Approach: Transform graph G=(V,E,W) into G’=(V,E’,W’) Tries to meet three objectives: Addition of potentially biologically valid edges Removal of potentially noisy edges Assignment of weights to the resultant set of edges that indicate their reliability Transformed PPI graph where Pi and Pj are connected if (Pi,Pj) is a hyperclique pattern Input PPI graph Pandey et al, “Association Analysis-based Transformations for Protein Interaction Networks: A Function Prediction Case Study”, ACM SIGKDD 2007, pp (Also selected for a Highlight talk at ISMB 2008). 8

Results on several S. cerevisiae interaction data sets
Accuracy of top-k predictions Combined from 3 interaction data sets (initial edge reliabilities from expression data). DIPCore (Highly reliable subset of DIP database). Results on simulated data sets show that this transformation is able to reduce the effect of noise on function prediction significantly.

An Association Analysis Approach for Biclustering
Bicluster: Group of genes that show coherent expression across a subset of conditions in a gene expression data set. Shown to be more effective than hierarchical clustering for finding functional modules [Prelic et al, 2006] Four general classes of biclusters studied [Madeira et al, 2004] Hystad et. al. 2007 Constant value biclusters Constant row biclusters Constant evolution

Biclustering Algorithms and Associated Issues
Several biclustering algorithms have been proposed Cheng & Church (2000) (CC): First approach for biclustering microarray data ISA [Bergmann et al, 2003] SAMBA [Tanay et al, 2004], Co-clustering [Dhillon et al, 2003] xMotifs [Murali & Kasif, 2003], OPSM [Ben-Dor et al, 2003] Typically these approaches work in one of the two ways: Start with all rows and columns and work in a top-down fashion by eliminating rows and columns based on some greedy strategy (CC). Start with a random seed of rows and columns and iteratively zero in on the final set of genes and conditions (ISA). Face several challenges: Cannot find biclusters exhaustively due to heuristic nature. Typically find large biclusters due to the objective function being achieved early and can miss small interesting biclusters.

Association Analysis for Biclustering
Association patterns are biclusters! Each pattern supported by a subset of transactions Can formulate a framework for directly extracting such patterns from real-valued data (microarray). Advantages: Exhaustive and efficient discovery of biclusters Disadvantages: Need to binarize/discretize the original real-valued data set for association mining [Becquet et al, 2002; Creighton et al, 2003; McIntosh et al, 2007], which causes a loss of information.

Range Support Patterns (RAP)
Need an anti-monotonic measure to compute the significance of patterns discovered directly from real-valued data. We propose Range Support: Sum of RS over all conditions. Constraints imposed: Consistency of expression values Same direction of expression These conditions satisfied over substantial number of conditions Range support is anti-monotonic! Can be used within an Apriori-like framework [Agrawal et al 1993; 1994] to find patterns. Implementation available at Pandey et al, “Association Analysis for Real-valued Data: Definitions and Application to Microarray Data”, TR , Deptt of Comp Sc & Engg, University of Minnesota.

Experimental Evaluation
Patterns derived at various thresholds from Hughes et al (2000)’s microarray compendium. Compared with ISA and CC biclusters. RAP3: RangeSupport=8, α=0.5; RAP5: RangeSupport=12, α=1. Evaluation using bicluster coherence and GO annotations Coherence measured using Mean Squared Error (Cheng & Church, 2000). Functional enrichment computed for each set of biclusters using GOTermFinder’s methodology (Boyle et al, 2004). Fraction of patterns enriched measured at various p-value thresholds. Classes separated into big ( members) and small (1-30 members). Evaluation restricted to 100 least overlapping patterns, to reduce the effect of redundancy (Prelic et al, 2006).

Bicluster coherence using MSE
Cheng and Church, 2000

Examples of biclusters found
CC1 bicluster with the lowest MSE ISA2 bicluster with the lowest MSE RAP3 bicluster with maximum number of genes RAP3 bicluster with highest MSE

GO enrichment results Functional enrichment for large classes ( members) Functional enrichment for small classes (1-30 members) Results from randomized patterns show that the results at the strictest p-value thresholds are the most statistically significant

GO enrichment results for several sets of small functional classes
Fraction of patterns (biclusters) enriched by several groups of small classes at p-value 1x10-5 Fraction of class covered by patterns (biclusters) among several groups of small classes at p-value 1x10-5

Complementarity of RAP and ISA Biclusters
{YAR010C,YBL005W-A,YBR012WA, YJR026W,YJR028W,YML040W,YMR046C,YMR051C} Enriched by RNA-mediated transposition class (GO: ; p-value=4.64×10−12). Not found or part of a large (138) ISA bicluster! {YDR158W,YER052C,YOR130C,YPR145W} Enriched by small classes GO: (6 members), GO: (3 members) and GO: (7 members). Embedded within large ISA biclusters of sizes 211 and 104. None of these classes were found to enrich these large biclusters! RAP patterns are complementary to ISA biclusters in terms of finding much smaller functionally enriched biclusters.

Summary RAP patterns are coherent, as measured by their MSE scores, among different sets of biclusters. RAP patterns are significantly more enriched by small and specific functional classes as compared to ISA and CC biclusters. Gene groups discovered by RAP are enriched for functions not covered by biclusters discovered by standard biclustering algorithms.

Discriminative Pattern Mining for Biomarker Discovery
Motivation: Genes (or SNPs) may NOT be predictive individually, but as a group they could be. Initially proposed by [Dong et al. 1999] and [Bay and Pazzani, 2001] Goal: Searching for group of genes that collectively show differential expression in a binarized case-control gene expression data set. The absolute difference of the relative supports of itemset α in A and B is defined as

Previous Work on Discriminative Pattern Mining
Two-step approaches Find all frequent patterns and then use class labels to select discriminative patterns. [Wang and Karypis 2005], [Cong et al 2005], [Deshpande et al. 2005], [Cheng et al. 2007], etc. Disadvantage: Not suitable for dense and high dimensional data Approaches that make limited use of class labels during pattern mining Loose upper bounds of measures of discriminative power are derived, e.g. DiffSup, Information Gain, etc. (Emerging Patterns [Dong et al. 1999], [Bay, Pazzani 2001] (CSET), [Cheng et al 2008], [Fan et al. 2008] etc.) Disadvantage: More scalable than 2-step approaches, but still not suitable for dense and high dimensional data, especially for discovering low-support patterns.

Our Focus Dense & high dimensional data, such as SNP and gene expression data. Discover interesting patterns with relatively low support that can not be discovered by traditional approaches. Traditional approaches (2-step & CSET) Our focus: Scalable discovery of low support patterns

SupMaxPair: a New Measure for Discriminative Pattern Mining
Make use of the class labels more effectively Low support patterns that can be pruned by most algorithms Uninteresting patterns that can NOT be pruned by existing algorithms Interesting low-support patterns that may NOT be discovered by existing algorithms High support patterns that can be discovered by most algorithms Fang et al., Discriminative Pattern Mining from Dense and High Dimensional Data, TR , CS Department, Univ of Minnesota, Software available at

Gene Expression Dataset
25000 genes in 295 breast cancer patients 78 patients developed metastasis within 5 years of surgery. 217 patients did not develop metastasis within 5 years of surgery 5981 genes are selected using the pre-processing methodologies suggested by [Vijver et al. NEJM 2001] Binarization Each column pertaining to the expression of a single gene is split into two binary columns (+0.2, -0.2) 11962 binary columns

Patterns Discovered by CSET and SupMaxPair
More low-support and interesting patterns are discovered by SupMaxPair beyond CSET Next, we will demonstrate these are statistically and biologically significant

Statistically significant
Permutation Test Part (a): 1000 highest p-value pattern generated from 1000 sets of randomized labels. Part (b): Top-300 p-value patterns discovered by SupMaxPair using the true class label. Statistically significant

Validation using Known Cancer Genes
A list of ~2400 human genes (Higgins et al, 2007) known to be involved in the induction, progression and suppression of various types of cancers 611 genes overlap with the dataset used in the study For each pattern, a precision is computed w.r.t. the 611 cancer genes 28 size-3 patterns have 100% precision (all 3 are cancer genes), and they are only discovered by SupMaxPair but not CSET. E.g., all genes in the pattern {BIRC5, LAD1, TK1} are known cancer genes. Biologically significant

Validation using Known Cancer Genes
For the genes covered by a set of patterns, a global precision and recall measure can be computed w.r.t the 611 cancer genes. CSET SupMaxPair can recall additional biologically significant cancer genes beyond CSET, by dicovering low-support patterns Varying p-value cutoffs SupMaxPair

Pattern-based Classification
CSET Patterns vs. (CSET + SupMaxPair) Patterns Patterns used as features for classification using SVM Data set split for training-testing (500 iterations) SupMaxPair can complement traditional algorithms (e.g. CSET), for better classification.

Summary On dense and high dimensional data:
SupMaxPair can complement traditional approaches by discovering statistically and biologically significant discriminative patterns with relatively low support. Can be useful for discovering gene group biomarkers for diseases. Fang et al., Discriminative Pattern Mining from Dense and High Dimensional Data, TR , CS Department, Univ of Minnesota, Software available at

University of Minnesota
Thank you! Group Members: Gaurav Pandey Gowtham Atluri Gang Fang Rohit Gupta Michael Steinbach Vipin Kumar University of Minnesota

Association Analysis Techniques for Bioinformatics Problems

Similar presentations

Presentation on theme: "Association Analysis Techniques for Bioinformatics Problems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Association Analysis Techniques for Bioinformatics Problems

Similar presentations

Presentation on theme: "Association Analysis Techniques for Bioinformatics Problems"— Presentation transcript:

Similar presentations

About project

Feedback