Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007.

Slides:



Advertisements
Similar presentations
Association Tests for Rare Variants Using Sequence Data
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
LSU-HSC School of Public Health Biostatistics 1 Statistical Core Didactic Introduction to Biostatistics Donald E. Mercante, PhD.
CMPUT 466/551 Principal Source: CMU
QTL Mapping R. M. Sundaram.
What is Statistical Modeling
Logistic Regression Part I - Introduction. Logistic Regression Regression where the response variable is dichotomous (not continuous) Examples –effect.
Evaluation.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Evaluation.
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Topic 3: Regression.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Experimental Evaluation
Chi-square test Pearson's chi-square (χ 2 ) test is the best-known of several chi-square tests. It is mostly used to assess the tests of goodness of fit.
Today Concepts underlying inferential statistics
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Evaluating Classifiers
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Chapter 4 Hypothesis Testing, Power, and Control: A Review of the Basics.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Multiple Choice Questions for discussion
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
Simple Linear Regression
Statistics for clinical research An introductory course.
by B. Zadrozny and C. Elkan
Chapter 10. Sampling Strategy for Building Decision Trees from Very Large Databases Comprising Many Continuous Attributes Jean-Hugues Chauchat and Ricco.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Contingency tables Brian Healy, PhD. Types of analysis-independent samples OutcomeExplanatoryAnalysis ContinuousDichotomous t-test, Wilcoxon test ContinuousCategorical.
The binomial applied: absolute and relative risks, chi-square.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Quantitative Genetics. Continuous phenotypic variation within populations- not discrete characters Phenotypic variation due to both genetic and environmental.
Quantitative Genetics
1 Nonparametric Statistical Techniques Chapter 17.
1 Risk Assessment Tests Marina Kondratovich, Ph.D. OIVD/CDRH/FDA March 9, 2011 Molecular and Clinical Genetics Panel for Direct-to-Consumer (DTC) Genetic.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Issues concerning the interpretation of statistical significance tests.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Organization of statistical research. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
BIOSTATISTICS Lecture 2. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and creating methods.
1 Probability and Statistics Confidence Intervals.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Association tests. Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
Direct method of standardization of indices. Average Values n Mean:  the average of the data  sensitive to outlying data n Median:  the middle of the.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.
Methods of Presenting and Interpreting Information Class 9.
A Multi-stage Approach to Detect Gene-gene Interactions Associated with Multiple Correlated Phenotypes Zhou Xiangdong,Keith Chan, Danhong Zhu Department.
Chapter 7. Classification and Prediction
Evaluating Classifiers
Diagnosis II Dr. Brent E. Faught, Ph.D. Assistant Professor
Multifactor-Dimensionality Reduction Reveals High-Order Interactions among Estrogen- Metabolism Genes in Sporadic Breast Cancer  Marylyn D. Ritchie, Lance.
Chapter 12 Power Analysis.
Presentation transcript:

Multifactor Dimensionality Reduction Laura Mustavich Introduction to Data Mining Final Project Presentation April 26, 2007

The Inspiration For a Method

The Nature of Complex Diseases Most common diseases are complex Caused by multiple genes Often interacting with one another This interaction is termed Epistasis

Epistasis When an allele at one locus masks the effect of an allele at another locus

The Failure of Traditional Methods Traditional gene hunting methods successful for rare Mendelian (single gene) diseases Unsuccessful for complex diseases:  Since many genes interact to cause the disease, the effect of any single gene is too small to detect  They do not take this interaction into account

MDR: The Algorithm

Multifactor Dimensionality Reduction A data mining approach to identify interactions among discrete variables that influence a binary outcome A nonparametric alternative to traditional statistical methods such as logistic regression Driven by the need to improve the power to detect gene-gene interactions

Multifactor Dimensionality Reduction

MDR Step 0 Divide data (genotypes, discrete environmental factors, and affectation status) into 10 distinct subsets

Multifactor Dimensionality Reduction

MDR Step 1 Select a set of n genetic or environmental factors (which are suspected of epistasis together) from the set of all variables in the training set

Multifactor Dimensionality Reduction

MDR Step 2 Create a contingency table for these multilocus genotypes, counting the number of affected and unaffected individuals with each multilocus genotype

Multifactor Dimensionality Reduction

MDR Step 3 Calculate the ratio of cases to controls for each multilocus genotype

Multifactor Dimensionality Reduction

MDR Step 4 Label each multilocus genotype as “high- risk” or “low-risk”, depending on whether the case-control ratio is above a certain threshold ****This is the dimensionality reduction step Reduces n-dimensional space to 1 dimension with 2 levels

Multifactor Dimensionality Reduction

MDR Step 5 Use labels to classify individuals as cases or controls, and calculate the misclassification rate

Multifactor Dimensionality Reduction

Repeat steps 1-5 for: All possible combinations of n factors All possible values of n Across all 10 training and testing sets

The Best Model Minimizes prediction error: the average misclassification rate across all the 10 cross-validation subsets Maximizes cross-validation consistency: the number of times a particular model was the best model across cross-validation subsets

Hypothesis test of best model: Evaluate magnitude of cross-validation consistency and prediction error estimates by permutation testing:  Randomize disease labels  Repeat MDR analysis several times to get distribution of cross-validation consistencies and prediction errors  Use distributions to determine p-values for your actual cross-validation consistencies and prediction errors

Permutation Testing: An illustration Sample Quantiles: 0% % % % % % % % % The probability that we would see results as, or more, extreme than , simply by chance, is between 5% and 10%

Strengths Facilitates simultaneous detection and characterization of multiple genetic loci associated with a discrete clinical endpoint by reducing the dimensionality of the multilocus data Non-parametric – no values are estimated Assumes no particular genetic model False-positive rate is minimized due to multiple testing

Weaknesses Computationally intensive (especially with >10 loci) The curse of dimensionality: decreased predictive ability with high dimensionality and small sample due to cells with no data

MDR Software

The Authors Multifactor dimensionality reduction software for detecting gene-gene and gene- environment interactions. Hahn, Ritchie, Moore,

Values Calculated by MDR MeasureFormula/Interpretation Balanced Accuracy(Sensitivity+Specificity)/2; fitness measure Accuracy is skewed in favor of the larger class, whereas balanced accuracy gives equal weight to each class Accuracy(TP+TN)/(TP+TN+FP+FN) Proportion of instances correctly classified SensitivityTP/(TP+FN); proportion of actual positives correctly classified SpecificityTN/(TN+FP); proportion of actual negatives correctly classified Odds Ratio(TP*TN)/(FP*FN); compares whether the probability of a certain event is the same for two groups X2X2 Chi-squared score for the attribute constructed by MDR from this attribute combination PrecisionTP/(TP+FP); the proportion of relevant cases returned Kappa2(TP*TN+FP*FN)/[(TP+FN)(FN+TN)+(TP+FP)*(FP+TN)] A function of total accuracy and random accuracy F-Measure2*TP/(2*TP+FP+FN); a function of sensitivity and precision

Sign Test n = number of cross-validation intervals C = number of cross-validation intervals with testing accuracy ≥ 0.5 The probability of observing c or more cross- validation intervals with testing accuracy ≥ 0.5 if each case were actually classified randomly

The Problem of Alcoholism A Case Study

Genes Associated With Alcoholism ADH enzymes ALDH2 enzyme Alcohol Acetaldehyd e Acetate ADH (alcohol dehydrogenase) and ALDH2 (acetaldehyde dehydrogenase 2) genes are associated with alcoholism involved in alcohol metabolism

ADH Genes ADH7ADH6ADH4ADH5ADH1BADH1AADH1C 5’ 3’ Class IV Class I Class VClass IIClass III 370 kb Chromosome 4

Taste Receptors and Aversion to Alcohol PTC TAS2R38 TastersNon-Tasters Alcohol Tastes BitterAlcohol Tastes Sweet Drink Less AlcoholDrink More Alcohol a person must be willing to drink in order to be an alcoholic TAS2R38 affects the amount of alcohol a person is willing to drink therefore, it is related to alcoholism, although no direct association has been found we hope to provide a direct link between TAS2R38 and alcoholism, by demonstrating that it acts epistatically with other genes associated with alcoholism

Actual Analysis

Data A sample of cases and controls (alcoholics and non-alcoholics) from three East Asian populations: the Ami, Atayal, and Taiwanese Genotyped for 98 markers within several genes: ALDH2, all ADH genes, and 2 taste receptor genes, TAS2R16 and TAS2R38 (PTC)

Computational Limitations 1. The software package has a problem reading missing data I was forced to use only complete records, dwindling my (already small) sample to 79 complete records

Computational Limitations 2. The computation time is way too long for higher order models, especially for high numbers of attributes I was advised to restrict my attributes to markers within ADHIC, and the 2 taste receptor genes, which left me with 36 attributes I considered models only up to order 4

Summary of Results: All Populations OrderModelTraining Bal. Acc.Testing Bal. Acc.Sign Test (p)CV Consistency 1X.04..ADH1C.dwstrm.Te (1.0000)5/10 2 X.07..TAS2R16.C_11431 X.04..ADH1C.dwstrm.Te (0.9453)6/10 3 X.07..TAS2R16.C_11431 X.04..ADH1C.dwstrm.Te X.04..ADH1C.rs (0.9990)4/10 4 X.07..TAS2R16.C_11431 X.07..PTC.C_ _1 X.07..PTC.C_ _1 X.04..ADH1C.dwstrm.Te (0.9893)6/10 Instances: 79Attributes: 36Ratio:

Summary of Results: Ami OrderModelTraining Bal. Acc.Testing Bal. Acc.Sign Test (p)CV Consistency 1X.07..TAS2R16.C_ (0.6230)5/10 2 X.07..TAS2R16.C_11431 X.04..ADH1C.C_ (0.9893)3/10 3 X.07..TAS2R16.C_11431 X.07..PTC.C_ _1 X.04..ADH1C.C_ (0.0010)10/10 4 X.07..TAS2R16.C_11431 X.07..TAS2R16.C_ X.07..PTC.C_ _1 X.04..ADH1C.C_ (0.0547)9/10 Instances: 30Attributes: 36Ratio:

Cross Validation Statistics Set MeasureTrainingTesting Balanced Accuracy Accuracy Sensitivity11 Specificity Odds Ratio∞∞ χ (p < ) (p = ) Precision Kappa F-Measure Sign Test: 10 (p = ) Cross-validation Consistency: 10/10

Whole Dataset Statistics: Training Balanced Accuracy: Training Accuracy: Training Sensitivity: Training Specificity: Training Odds Ratio: ∞ Training Χ²: (p < ) Training Precision: Training Kappa: Training F-Measure:

Graphical Model

Classification Rules X.07..TAS2R16.C_11431X.07..PTC.C_ _1X.04..ADH1C.C_ Class IF A\A AND C\G AND C\C THEN 0 A\AC\GC\T1 A\AC\GT\T0 A\AG\GC\C0 A\AG\GC\T0 A\AG\GT\T1 A\GC\CC\T1 A\GC\GC\C0 A\GC\GC\T0 A\GC\GT\T0 A\GG\GC\C1 A\GG\GC\T1 A\GG\GT\T0 G\GC\GC\T1 G\G C\C0 G\G C\T1 G\G T\T1

Locus Dendrogram

Future Work Simulations to calculate the power of MDR, especially in relation to sample size Comparison of MDR with logistic regression, and other proposed methods to detect epistasis, with respect to the current data set and simulated data Research how different methods to search the sample space can be incorporated into MDR implementation to improve computational feasibility