Transactional Database Transformation and Its Application in Prioritizing Human Disease Genes Yang Xiang Department of Biomedical Informatics, The Ohio State University Homepage: http://bmi.osu.edu/~yxiang Joint work with Philip R.O. Payne and Kun Huang To appear in IEEE/ACM Transactions on Computational Biology and Bioinformatics

Motivation: Netflix problem The Netflix Problem: Given the current user ratings, how to recommend movies to users? ???????4?? ?3???????? ???2????4? ??5??3???? ????4????? ??3???2??? ???1?????? ????????1? Users Movies

Motivation: Matrix Completion 0s are unsampled entries, other values are sampled entries 0000000400 0300000000 0002000040 0050030000 0000400000 0030002000 0001000000 0000000010 Users Movies Can we recover such kinds of matrices?

Matrix Completion Theory and methods If the number m of sampled entries obeys for some positive numerical constance C, then with very high probability, most n*n matrices of rank r can be perfectly recovered. [Candès et al. Exact Matrix Completion via convex optimization, Foundations of Computational Mathematics, 9(6), 717-772.] Matrix completion methods (http://perception.csl.uiuc.edu/matrix- rank/sample_code.html#MC) – Singular Value Thresholding – OptSpace – Acceloerated Proximal Gradient – Subspace Evolution and Transfer – Grouse

Transactional Database (0,1)-matrix Bipartite graph TransactionItems 1Bread, Diaper, Eggs 2Beer, Coke, Apples, 3Bread, Milk, Beer, Coke 4Diaper, Eggs, Apples 5Bread, Beer, Coke BreadMilkDiaperBeerEggsCokeApples 11010100 20001011 31101010 40010101 51001010 Transactional Database (0,1)-matrix 1 2 3 4 5 Bread Milk Diaper Beer Eggs Coke Apples

Question: Can (0,1)-matrix be completed? BreadMilkDiaperBeerEggsCokeApples 11010100 20001011 31101010 40010101 51001010 Consider each transaction is a customer. What is each customers altitude towards un-purchased items (i.e., 0 entries)? It does not make a good sense to use the sampling model here as for the matrix completion, i.e., non-zero is a sample entry and zero is a unsampled entry.

Our proposal: (0,1)-matrix transformation An entry is evaluated by its support patterns (independent evidence). P is a supporting pattern for entry (i,j) if and only if P covers (i,j) and, M(x,y)=1 for any entry (x,y)ϵP\{(i,j)} Since the value of (i,j) is not considered for a supporting pattern, the supporting pattern of an entry is independent of the entry value.

Support Pattern Measurement used in this work Biomedical Informatics question: How to efficiently transform M into F defined above, such that F can unbiasedly predict the unkown gene- phenotype relationships?

Find support patterns and calculate F (i,j) for one entry 1010011001 1101101101 0011010011 0110111000 1011000110 0101111011 1100101110 1001010100 Find support patterns for the magenta entry (4,d) 1010011001 1101101101 0011010011 0110111000 1011000110 0101111011 1100101110 1001010100 1 2 3 4 5 6 7 8 a b c d e f g h i j 1 2 3 4 5 6 7 8 a bc d e f g h i j 10101 01010 01000 10111 00010 2 3 5 6 8 bce f g 2 3 5 6 8 b c e f g Find the maximum edge biclique F (4,d)=6

Maximal biclique and maximum edge biclique A biclique is maximal if it cannot be extended. Maximum edge biclique is a maximal biclique with the maximum number of edges. Listing all maximal biclique is a NP-hard problem. Find one maximum edge biclique is NP-hard too.

Solutions for listing all maximal bicliques Associate Rule Mining – Frequent Itemset An itemset whose support is no less than a minimum support (minsup) threshold. In the transaction example, set minsup=3, then {beer} is a frequent itemset. {beer, coke} is too. – (Frequent) Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset – Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent

Solutions for listing all maximal bicliques A close itemset with its supporting transaction set exactly corresponds to a maximal biclique in the corresponding bipartite graph Using frequent closed itemset to approximate closed itemset. MAFIA: Mining frequent itemset, frequent closed itemset, and maximal frequent itemset. http://himalaya-tools.sourceforge.net/Mafia/ BreadMilkDiaperBeerEggsCokeApples 11010100 20001011 31101010 40010101 51001010 1 2 3 4 5 Bread Milk Diaper Beer Eggs Coke Apples

Solution Summary for one entry (i,j) Construct a submatrix corresponding to the entry (i,j). Using frequent closed mining tools to build frequent closed itemsets (set the support threshold as low as the computer can handle) Build supporting transactions for the frequent closed itemsets, thus we obtained all the candidate maximal bicliques. Find the maximum edge biclique and get the F (i,j) value.

How about all entries? The previous solution is for one entry. How about all entries in a m*n matrix? Simply repeating the previous calculation for m*n times is not a wise choice.

IndEvi Algorithm in a Nutshell Assume input is a set of maximal cliques of the original (0,1)-matrix. Project each maximal clique horizontally and vertically. Let C be the maximal clique as shown by the shaded area. Can you figure out how to calculate F C (i,j) for an entry (i,j)? Each entry will remember the largest F C (i,j).with respect to all Cs. Please refer to the paper for the algorithm detail.

IndEviRe Algorithm: Independent Evidence Reconstruction IndEvi algorithm ensure an entry (i,j) remember the largest F C (i,j) value, and the corresponding reference to C in the set of maximal bicliques. IndEviRe algorithm reconstructs the support pattern according to the reference and the value of (i,j).

Key theorem: unbiased predicting

Application in Prioritizing Human Disease Genes Transactional data: gene-to-phenotype (G2P) dataset from http://human-phenotype-ontology.org (10/03/2010) http://human-phenotype-ontology.org Closed itemset generator: MAFIA http://himalaya-tools.sourceforge.net/Mafia/ http://himalaya-tools.sourceforge.net/Mafia/ Platform: Linux, C++, STL Cross-validate Platform (10/04/2010): www.geneanswers.com (GACOM) www.geneanswers.com

Measurement: Fold Enrichment Intuitively, fold enrichment measures how good known disease genes are ranked among all genes

Results Among all 34503(=|E|) known gene-phenotype relations, 4598(=|E|) of them with gene ranked among the top 0.1107% (=y%) of the 1807 candidate genes for it, achieving a 120.4 (x/y=13.3264/0.1107) fold-enrichment. Rank Cutoff

Case Study: Colon Cancer

Case Study: Breast Cancer

Case Study: Osteoarthritis Supporting pattern (by IndEviRe) for TNXB: {COL3A1, COL5A1, COL5A2, TNXB}*{AUTOSOMAL DOMINANT INHERITANCE, ECCHYMOSES, JOINT DISLOCATION, MITRAL VALVE PROLAPSE, SOFT SKIN, OSTEOARTHRITIS} Supporting pattern (by IndEviRe) for VWF: {COL3A1, COL5A1, COL5A2, TNXB, VWF}*{AUTOSOMAL DOMINANT INHERITANCE, ECCHYMOSES,, MITRAL VALVE PROLAPSE, OSTEOARTHRITIS}

Conclusion The supporting patterns for an entry in (0,1)-matrix is a good resource for knowledge inference. Frequent closed itemset mining provide a practical platform for solving our problems. IndEvi and IndEviRe algorithms can efficiently calculate F score and reconstruct evidence for any entry, with the input of maximal bicliques. The result for an entry is independent of its original value (0 or 1). Only one call of frequent closed itemset mining on the original matrix is necessary. Readers may revise the F function for different applications. The algorithm is simple to implement, and the result is easy to analyze. Our method has a wide range of applications. The study on human gene-phenotype data shows that our method is efficient and effective.

