Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Techniques For Correlating Phenotypic Expressions With Genomic and Medical Characteristics This work has been supported by DTC, IBM and NSF.

Similar presentations


Presentation on theme: "Data Mining Techniques For Correlating Phenotypic Expressions With Genomic and Medical Characteristics This work has been supported by DTC, IBM and NSF."— Presentation transcript:

1 Data Mining Techniques For Correlating Phenotypic Expressions With Genomic and Medical Characteristics This work has been supported by DTC, IBM and NSF grant and Computational resources for this work were provided by the Minnesota Supercomputing Institute. Acknowledgements References e-coords: R. Mushlin, A. Kirshenbaum, S. Gallagher, T. Rebbeck, A graph-theoretical approach for pattern discovery in epidemiological research, IBM Systems Journal 46, No. 1, (2007) Jason H. Moore; Marylyn D. Ritchie, The Challenges of Whole-Genome Approaches to Common Diseases, JAMA : L. Bastone, M. Reilly, D. L. Rader, and A. S. Foulkes, MDR and PRP: A Comparison of Methods for High-Order Genotype-Phenotype Associations, Human Heredity 58, No. 2, 2-92 (2004) A. S. Foulkes, M. Reilly, L. Zhou, M. Wolfe, and D. J. Rader, Mixed Modeling to Characterize Genotype Phenotype Associations, Statistics in Medicine 24, No. 5, (2005) A. Hattersley and M. McCarthy, What makes a good genetic association study? The Lancet, Volume 366, Issue 9493, Pages , Oct. 2005 Seppänen, J. K. and Mannila, H Dense itemsets. In Proceedings of the Tenth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Seattle, WA, USA, August , 2004). KDD '04. ACM Press, New York Tan, P.-N., Steinbach, M. and Kumar, V., Introduction to Data Mining, Pearson Addison-Wesley, May 2005 Rohit Gupta, Blayne Field, Michael Steinbach, Vipin Kumar, Rich Mushlin*, Fred Kulack+ Department of Computer Science and Engineering, University of Minnesota (200 Union Street SE, Minneapolis MN USA) *IBM T. J Watson Research Center, +IBM Rochester Obtaining genomic information is increasingly affordable Single Nucleotide Polymorphisms (SNPs) offer the potential to tests for disease or susceptibility for disease Electronic medical records (EMRs) are becoming increasingly common Automated analysis of patient information is now possible This revolution in genetic and medical potentially leads to Personalized medicine, i.e., using detailed genomic and medical information about a person for the detection, treatment, or prevention of disease Given: A patient data set that records Phenotypic Expression (Disease) Genetic characteristics Medical characteristics Objective: Finding patterns combining medical and genetic characteristics that best defines the phenotypic expression under study Challenges: High dimensionality and low sample size Combinatorial explosion Noise Non-linear interactions Various association analysis algorithms have been applied to find connections between genetic characteristics (SNPs) and disease Techniques for finding closed itemsets have proven effective for finding SNP patterns in synthetic data Algorithms exist for finding ETIs have shown promise, but the evaluation is not complete Computational demands of the algorithms are high Odds Ratio and P-value are found to be the best indicator of real patterns for synthetic SNP data. They are also found to be highly correlated to other similarity measures Project Motivation Problem Formulation INTRODUCTION METHODS RESULTS AND DISCUSSIONS Data Set Genetic data (SNPs) Simulated SNP data using known models has been used for this study. Approximately, 2000 cases and 6000 control records have been generated Real SNP data for Parkinson’s and Myeloma disease. Conclusions Cases Controls Row Margins With Pattern a b Nwith Without Pattern c d Nwithout Column Margins Ncases Ncontrols Ntotal a, b, c, and d are the number of cases with the pattern, controls with the pattern, cases without the pattern, and controls without the pattern, respectively. Evaluation Measures There are many different figures of merit (FOM), i.e. functions of a, b, c, d, that can be used to characterize the table We use odds ratio (OR), and P-value (P) OR quantifies how different are cases and controls for a specific pattern P quantifies the significance of the difference reflected by OR Odds Ratio, OR = a*d / b*c P is the probability of a table (shown above) with the same fixed margins having a higher (or same) OR Probability distribution, p, as a function of odds ratio, OR, for Ntotal = 1000 and several sets of margins (Full range of points is shown). The margins in the legend are in the order Ncases, Ncontrols, Nwith, Nwithout Association Analysis Patients Genetic Information (SNPs) as Binary Matrix and disease (Yes/No) as Class Label. Data Mining-based association analysis is applied to find patterns that capture the connections between SNPs and disease Frequent closed itemsets capture SNP patterns where all SNPs must be present Error-tolerant itemsets (ETIs) capture more general SNP patterns, where not all SNPs need to occur in all patients defining the pattern Existing techniques includes statistical association analysis, Logistic Regression, Multifactor Dimensionality Reduction, CART, Random Forests, etc Based on the disease variable, patients are categorized as cases or controls. First, we find patterns (closed itemsets or ETIs) in cases and then check for their presence in control patients. Odds Ratio (OR) and P-value metrics (as described below) are used to evaluate the identified patterns  = 1/4. In other words, each transaction needs to have 3/4 (75%) of the items {i1, i2, i3, i4} and {i5, i6, i7, i8} are both ETIs with a support of 4 Figures of Merit for 2 x 2 table Itemset Odds Ratio -log10(pvalue) aa1 aa2 aa3 aa4 5.442 5.452 Aa1 aa2 aa4 Aa8 1.661 3.935 aa1 Aa2 aa3 AA5 AA6 3.002 3.770 Aa1 aa2 AA5 AA6 AA7 Aa8 3.845 3.739 aa1 aa2 AA7 Aa8 1.934 3.661 aa1 aa2 aa3 AA5 2.844 3.541 aa1 aa3 AA5 AA6 1.965 3.503 aa2 aa3 AA5 Aa7 Aa8 2.177 3.448 aa2 aa3 AA5 Aa7 1.682 3.421 aa1 aa3 2.486 3.414 Find strong patterns in cases Evaluate strength of patterns in controls Rank all the patterns using OR and p-value to obtain final results


Download ppt "Data Mining Techniques For Correlating Phenotypic Expressions With Genomic and Medical Characteristics This work has been supported by DTC, IBM and NSF."

Similar presentations


Ads by Google