Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio.

Similar presentations


Presentation on theme: "Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio."— Presentation transcript:

1 Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio State University Homepage: Joint work with Philip R.O. Payne and Kun Huang To appear in IEEE/ACM Transactions on Computational Biology and Bioinformatics

2 Motivation: Netflix problem The Netflix Problem: Given the current user ratings, how to recommend movies to users? ???????4?? ?3???????? ???2????4? ??5??3???? ????4????? ??3???2??? ???1?????? ????????1? Users Movies

3 Motivation: Matrix Completion 0s are unsampled entries, other values are sampled entries Users Movies Can we recover such kinds of matrices?

4 Matrix Completion Theory and methods If the number m of sampled entries obeys for some positive numerical constance C, then with very high probability, most n*n matrices of rank r can be perfectly recovered. [Candès et al. Exact Matrix Completion via convex optimization, Foundations of Computational Mathematics, 9(6), ] Matrix completion methods (http://perception.csl.uiuc.edu/matrix- rank/sample_code.html#MC) – Singular Value Thresholding – OptSpace – Acceloerated Proximal Gradient – Subspace Evolution and Transfer – Grouse

5 Transactional Database (0,1)-matrix Bipartite graph TransactionItems 1Bread, Diaper, Eggs 2Beer, Coke, Apples, 3Bread, Milk, Beer, Coke 4Diaper, Eggs, Apples 5Bread, Beer, Coke BreadMilkDiaperBeerEggsCokeApples Transactional Database (0,1)-matrix Bread Milk Diaper Beer Eggs Coke Apples

6 Question: Can (0,1)-matrix be completed? BreadMilkDiaperBeerEggsCokeApples Consider each transaction is a customer. What is each customers altitude towards un-purchased items (i.e., 0 entries)? It does not make a good sense to use the sampling model here as for the matrix completion, i.e., non-zero is a sample entry and zero is a unsampled entry.

7 Our proposal: (0,1)-matrix transformation An entry is evaluated by its support patterns (independent evidence). P is a supporting pattern for entry (i,j) if and only if P covers (i,j) and, M(x,y)=1 for any entry (x,y)ϵP\{(i,j)} Since the value of (i,j) is not considered for a supporting pattern, the supporting pattern of an entry is independent of the entry value.

8 Support Pattern Measurement used in this work Biomedical Informatics question: How to efficiently transform M into F defined above, such that F can unbiasedly predict the unkown gene- phenotype relationships?

9 Find support patterns and calculate F (i,j) for one entry Find support patterns for the magenta entry (4,d) a b c d e f g h i j a bc d e f g h i j bce f g b c e f g Find the maximum edge biclique F (4,d)=6

10 Maximal biclique and maximum edge biclique A biclique is maximal if it cannot be extended. Maximum edge biclique is a maximal biclique with the maximum number of edges. Listing all maximal biclique is a NP-hard problem. Find one maximum edge biclique is NP-hard too.

11 Solutions for listing all maximal bicliques Associate Rule Mining – Frequent Itemset An itemset whose support is no less than a minimum support (minsup) threshold. In the transaction example, set minsup=3, then {beer} is a frequent itemset. {beer, coke} is too. – (Frequent) Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset – Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent

12 Solutions for listing all maximal bicliques A close itemset with its supporting transaction set exactly corresponds to a maximal biclique in the corresponding bipartite graph Using frequent closed itemset to approximate closed itemset. MAFIA: Mining frequent itemset, frequent closed itemset, and maximal frequent itemset. BreadMilkDiaperBeerEggsCokeApples Bread Milk Diaper Beer Eggs Coke Apples

13 Solution Summary for one entry (i,j) Construct a submatrix corresponding to the entry (i,j). Using frequent closed mining tools to build frequent closed itemsets (set the support threshold as low as the computer can handle) Build supporting transactions for the frequent closed itemsets, thus we obtained all the candidate maximal bicliques. Find the maximum edge biclique and get the F (i,j) value.

14 How about all entries? The previous solution is for one entry. How about all entries in a m*n matrix? Simply repeating the previous calculation for m*n times is not a wise choice.

15 IndEvi Algorithm in a Nutshell Assume input is a set of maximal cliques of the original (0,1)-matrix. Project each maximal clique horizontally and vertically. Let C be the maximal clique as shown by the shaded area. Can you figure out how to calculate F C (i,j) for an entry (i,j)? Each entry will remember the largest F C (i,j).with respect to all Cs. Please refer to the paper for the algorithm detail.

16 IndEviRe Algorithm: Independent Evidence Reconstruction IndEvi algorithm ensure an entry (i,j) remember the largest F C (i,j) value, and the corresponding reference to C in the set of maximal bicliques. IndEviRe algorithm reconstructs the support pattern according to the reference and the value of (i,j).

17 Key theorem: unbiased predicting

18 Application in Prioritizing Human Disease Genes Transactional data: gene-to-phenotype (G2P) dataset from (10/03/2010) Closed itemset generator: MAFIA Platform: Linux, C++, STL Cross-validate Platform (10/04/2010): (GACOM)

19 Measurement: Fold Enrichment Intuitively, fold enrichment measures how good known disease genes are ranked among all genes

20 Results Among all 34503(=|E|) known gene-phenotype relations, 4598(=|E|) of them with gene ranked among the top % (=y%) of the 1807 candidate genes for it, achieving a (x/y= /0.1107) fold-enrichment. Rank Cutoff

21 Case Study: Colon Cancer

22 Case Study: Breast Cancer

23 Case Study: Osteoarthritis Supporting pattern (by IndEviRe) for TNXB: {COL3A1, COL5A1, COL5A2, TNXB}*{AUTOSOMAL DOMINANT INHERITANCE, ECCHYMOSES, JOINT DISLOCATION, MITRAL VALVE PROLAPSE, SOFT SKIN, OSTEOARTHRITIS} Supporting pattern (by IndEviRe) for VWF: {COL3A1, COL5A1, COL5A2, TNXB, VWF}*{AUTOSOMAL DOMINANT INHERITANCE, ECCHYMOSES,, MITRAL VALVE PROLAPSE, OSTEOARTHRITIS}

24 Conclusion The supporting patterns for an entry in (0,1)-matrix is a good resource for knowledge inference. Frequent closed itemset mining provide a practical platform for solving our problems. IndEvi and IndEviRe algorithms can efficiently calculate F score and reconstruct evidence for any entry, with the input of maximal bicliques. The result for an entry is independent of its original value (0 or 1). Only one call of frequent closed itemset mining on the original matrix is necessary. Readers may revise the F function for different applications. The algorithm is simple to implement, and the result is easy to analyze. Our method has a wide range of applications. The study on human gene-phenotype data shows that our method is efficient and effective.

25 Thanks! Questions?


Download ppt "Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio."

Similar presentations


Ads by Google