Mining Association Rules from Microarray Gene Expression Data.

Mining Association Rules from Microarray Gene Expression Data

2 Association Rules Form: LHS => RHS, where LHS and RHS are disjoint itemsets. Frequent itemset: support T (I) ≥ α, where support T (I) is the number of transactions in T that contains all the items in I and α is the minimum support.

3 Objective Measurements of Rule Interestingness Support of the rule LHS => RHS: Support(LHS U RHS), frequency that LHS and RHS occur together in a transaction. Confidence of the rule LHS => RHS: Support (LHS U RHS)/Support (LHS), frequency that RHS occurs when LHS occurs.

4 Association Rule Discovery (ARD) algorithms Originally used in market basket analysis Unsupervised and domain independent Example: Apriori algorithm (Agrawal, 1993) Advantage: Finds all associations Disadvantage: Very large number of associations

5 Microarray technology Measurement of gene expression levels in cells Simultaneous measurement of thousands of genes Facilitates the study of gene interactions Expression profile=set of values for different conditions –One condition = one slide –condition: tissue, treatment or time point Data matrices

6 Rules Applied to Gene Expression Each gene expression experiment is a single transaction and each gene is an item. –Gene value may be numerical, may need to be binned as being up (expressed), down (repressed), or neither. Items in a gene expression transaction can also include relevant facts describing the cellular environment. Find frequent itemsets: apply Apriori algorithm. Generate association rules from the frequent itemsets.

7 Example - association rules Database T1:A  B  C  D  E  T2:A  B  C  D  E  T3:A  B  C  D  E  T4:A  B  C  D  E  T5:A  B  C  D  E  Rules R1:A   B   C   E  (sup=0.40, conf=0.67) R2:A   B   D  (sup=0.40, conf=1.00) R3:A   B  (sup=0.60, conf=1.00)

8 Rule induction algorithms G1 G2 G3 G4 Class S1     Cancer S2     Non- Cancer... Sn     Non- Cancer R1 : G1   G3   Cancer R2 : G1   G4   Non-Cancer R3: G1   G3   G2   Cancer G1 is Gene 1, S1 is sample 1,  means overexpressed, and  means underexpressed. Rule set Association rule discovery algorithm *

9 Forming Association rules Any frequent itemset of size greater than one can be divided into two itemsets, LHS and RHS. Using objective measures: If the confidence of a candidate rule exceeds a speficied minimum confidence criterion, the rule may be included. A very large set of rules is usually generated.

10 Example 28 treatments were recorded on 28 microarray chips– 28 transactions. Each chip contains expression levels of approximately 6200 genes (items). After Apriori algorithm, 70,000,000 rules were generated.

11 Subjective Measures for Selecting Rules Select genes before finding frequent items – limit the number of items. Limit the size of LHS or RHS E.g., LHS contains only one item. Domain specific measures -- Use knowledge about the domain under study to specify patterns.

12 Rule Filtering and Group (Tuzhilin and Adomavicius 2002) Rule templates – specify restrictions on the combinations of genes and their expression levels that can appear in the body and head of the rule. RulePart HAS Quantifier OF C 1, C 2,..., C N [ONLY] –RulePart: BODY, HEAD, or RULE. –C 1, C 2,..., C N is a comparison set, representing a list of genes against which the discovered rules will be compared: A gene, e.g., G17 A gene with a particular expression level, e.g., G17↑ A group (category) of genes, e.g., [DNA_Repair] A group of genes with an expression level, e.g., [DNA_Repair] A group of genes with a list of allowable or unallowable expression levels, e.g., [DNA_Repair] = {↑, #}

13 Rule Filtering con’t Quantifier: a keyword or an expression specifying how many genes specified by C 1, C 2,..., C N List have to be contained in RulePart. –ALL, ANY, NONE, specifying the number of genes from C 1, C 2, …, C N the RulePart must have –A numeric value; e.g., 2, specifying a rule must have exactly 2 genes from the comparison set –A range of numeric values; e.g., 1,3,5-7 ONLY is used to indicate RulePart can have only the genes in the C 1, C 2, …, C N list.

14 Rule Filtering con’t All rules that contain at least one of the following genes: G 1, G 5, G 7 : –RULE HAS (ANY) of G 1, G 5, G 7, –Matching rules: G 1 ↑ => G 3 ↑. "When genes involved in the DNA repair are upregulated, what other gene categories are also up- or downregulated?" –BODY HAS (ANY) OF [DNA_Repair]  AND HEAD HAS (ANY) OF [All_Genes]={ ,  }

15 Macro Templates (Tuzhilin and Adomavicius 2002) Detect unexpected rules : CONTRADICT (GeneExprSet, G, ExpLevel)= BODY HAS (ALL) of GeneExprSet AND HEAD HAS (ALL) OF G ≠ {ExpLevel} –CONTRADICT({G1 , G2  },G4,  } –Unexpected rule: G1  G2  G3  G4 

16 Rule Grouping Group similar rules together into classes to be analyzed Gene hierarchy: group genes based on their functions. ALL F1F2 G1 G2G3G4G5

17 Rule Grouping con’t Aggregated rules –Groups: F1={G1,G2,G3}, F2={G4,G5} –Rules: R1={G1  G4  }, R2={G1  G5  }, R3={G1  G3  G5  } –Aggregated rule: R=F1  F2, R'=F1  F2  –{R1,R2,R3}  R, {R1,R2}  R', R3  R'

18 Rule Derivation Procedure Feature selection Rule induction Rule selection Rule validation Database of transactions Reduced database Initial rule set Relevant rule set Rules representing knowledge Some features may be irrelevant or a reduction may be required for efficiency reasons A rule induction algorithm is applied to the database (e.g. association rule algorithm or decision tree algorithm) Rules are assessed and ranked using different measures of interestingness The rules need to be validated to be accepted as knowledge. Can be done by more detailed biological experiments in the context of gene expression data

Mining Association Rules from Clinical Data

20 Association Rules in Medical Data Medical record data –Millions of “claims” for medical procedures –Each patient may have several claims –Each claim may have several line items or records, one for each procedure performed. –A diagnosis was reported with each procedure. Data: patient code, procedure code, diagnosis code Goal: discover relationships between procedure performed on a patient and the reported diagnosis.

21 Con’t Data items: the set of all procedure and diagnosis codes (7,365 + 9,383=16,748) Transaction: the set of procedure and diagnosis codes for each patient (1,257,645 patients) –May cause unexpected rule: {cast} => {heart disease} Each itemset consists of one or more procedure/diagnosis codes. The support of an itemset I is the number of patients whose set of items include all the items in I (> 1%).

22 Formulating Rules Applying Aprior algorithm to generate all frequent itemsets. Restricting to those frequent itemsets which contain both procedures and diagnoses For each selected frequent itemset, one rule is formulated with all procedure codes on the left and all diagnosis codes on the right. Computing confidence and eliminating rules with confidence less than 65%.

Mining Association Rules from Microarray Gene Expression Data.

Similar presentations

Presentation on theme: "Mining Association Rules from Microarray Gene Expression Data."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mining Association Rules from Microarray Gene Expression Data.

Similar presentations

Presentation on theme: "Mining Association Rules from Microarray Gene Expression Data."— Presentation transcript:

Similar presentations

About project

Feedback