# Detecting Novel Associations in Large Data Sets

## Presentation on theme: "Detecting Novel Associations in Large Data Sets"— Presentation transcript:

Detecting Novel Associations in Large Data Sets
A pragmatic discussion of Detecting Novel Associations in Large Data Sets by David N. Reshef, Yakir Reshef, Hilary Finucane, Sharon Grossman, Gilean McVean, Peter Turnbaugh, Eric Lander, Michael Mitzenmacher, and Pardis Sabeti David Reshef I am currently an MD/PhD student in the Harvard-MIT Health Sciences and Technology (HST) program. Previously, I studied statistics at the University of Oxford, and computer science at MIT. I am broadly interested in the areas of machine learning, statistical inference, and information theory. My work focuses on developing tools for identifying structure in large datasets using techniques from these fields. Yakir Rashef I am a Fulbright scholar in the Department of Applied Math and Computer Science at the Weizmann Institute of Science. I am interested in using mathematics, statistics, and computer science to develop methods for making sense of large datasets in fields like genomics, neuroscience, and public health. In 2009, I graduated from Harvard College with a BA in mathematics. I wrote my undergraduate thesis with Salil Vadhan. Sean Patrick Murphy

Getting Started Blog overview - MINE code (Java-based with python and R wrappers) MINE homepage - Science article and supplemental information -

So who actually read the paper?
A statistic (singular) is a single measure of some attribute of a sample (e.g. its arithmetic mean value). It is calculated by applying a function (statistical algorithm) to the values of the items comprising the sample which are known together as a set of data.

Outline Motivation Explanation Application

The Problem 10,000+ variables
Motivation The Problem 10,000+ variables Hundreds, thousands, millions of observations Your boss wants you to find all possible relationships between all different variable pairs … Where do you start?

Motivation Scatter Plots?

50 Variables  1225 different scatter plots to examine!
Motivation 50 Variables  1225 different scatter plots to examine!

Other Options? Correlation Matrix
Motivation Other Options? Correlation Matrix Factor Analysis/Principal Component Analysis Audience recommendations?

Possible Problems A large number of possible relationships
Motivation Possible Problems A large number of possible relationships Each has a different statistical test Need to have a hypothesis about the relationship that might be present in the data

Motivation Desired Properties Generality – the correlation coefficient should be sensitive to a wide range of possible dependencies, including superpositions of functions. Equitability – the score of the coefficient should be influenced by noise, but not by the form of the dependency between variables

Enter the Maximal Information Coefficient (MIC)
Explanation Enter the Maximal Information Coefficient (MIC)

Explanation Algorithm Intuition

Explanation We have a dataset D x y

Explanation

Definition of mutual information (for discrete random variables)
Explanation Mutual information is a measure of dependence in the following sense: I(X; Y) = 0 if and only if X and Y are independent random variables. This is easy to see in one direction: if X and Y are independent, then p(x,y) = p(x) p(y), and therefore: Definition of mutual information (for discrete random variables)

Explanation MI = 0.5 MI = 0.6 MI = 0.7 Maximum mutual information

Characteristic Matrix
Explanation Characteristic Matrix We have to normalize by min {log x, log y} to enable comparison across grids.

Explanation 2x3 MI = 0.65 MI = 0.56 MI = 0.71

Characteristic Matrix
Explanation Characteristic Matrix

Characteristic Matrix
Explanation Characteristic Matrix

This highest value is the Maximal Information Coefficient (MIC)
Explanation This highest value is the Maximal Information Coefficient (MIC) Every entry of the characteristic matrix is between 0 and 1, inclusive MIC(X,Y) = MIC(Y,X) – symmetric MIC is invariant under order preserving transformations of the axis This surface is just a 3D representation of the characteristic matrix.

How Big is the Characteristic Matrix?
Explanation How Big is the Characteristic Matrix? Technically, infinite in size This is unwieldy So we set bounds on xy < B(n) = n0.6 n = number of data points This is an empirically set value Unlike R-squared, the method of Reshef et al. depends on a tuning parameter that controls the level of discretization, in a “How long is the coast of Britain” sort of way. The dependence on scale is inevitable for such a general method. Just consider: if you sample 1000 points from the unit bivariate normal distribution, (x,y) ~ N(0,I), you’ll be able to fit them perfectly by a 999-degree polynomial fit to the data. So the scale of the fit matters.

How Do We Compute the Maximum Information for a Particular xy Grid?
Explanation How Do We Compute the Maximum Information for a Particular xy Grid? Heuristic-based, dynamic programming Pseudo-code in supplemental materials Only approximate solution, seems to work Authors acknowledge better algorithm should be found At the moment, mostly irrelevant as the authors have released a Java implementation of the algorithm

Useful Properties of the MIC Statistic
Application Useful Properties of the MIC Statistic With probability approaching 1 as sample size grows MIC assigns scores that tend to 1 for all never-constant noiseless functional relationships MIC assigns scores that tend to 1 for a larger class of noiseless relationships (including superpositions of noiseless functional relationships) MIC assigns scores that tend to 0 to statistically independent variables

Application MIC

Application

So what does the MIC mean?
Application So what does the MIC mean? Uncorrected p-value tables are available to download for various sample sizes of data Null hypothesis is variables are statistically independent In statistical significance testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. One often "rejects the null hypothesis" when the p-value is less than the significance level α (Greek alpha), which is often 0.05 or 0.1. When the null hypothesis is rejected, the result is said to be statistically significant. In statistics, the Bonferroni correction is a method used to counteract the problem of multiple comparisons. It was developed and introduced by Italian mathematician Carlo Emilio Bonferroni. The correction is based on the idea that if an experimenter is testing n dependent or independent hypotheses on a set of data, then one way of maintaining the familywise error rate is to test each individual hypothesis at a statistical significance level of 1/n times what it would be if only one hypothesis were tested. So, if it is desired that the significance level for the whole family of tests should be (at most) α, then the Bonferroni correction would be to test each of the individual tests at a significance level of α/n. Statistically significant simply means that a given result is unlikely to have occurred by chance assuming the null hypothesis is actually correct (i.e., no difference among groups, no effect of treatment, no relation among variables). The Bonferroni correction is derived by observing Boole's inequality. If n tests are performed, each of them significant with probability β, (where β is unknown) then the probability that at least one of them comes out significant is (by Boole's inequality) ≤ nβ. Our intention is for this probability to equal α, the significance level for the entire series of tests. By solving for β, we get β = α/n. This result does not require that the tests be independent.

MINE = Maximal Information-based Nonparametric Exploration
Application MINE = Maximal Information-based Nonparametric Exploration Hopefully this part is self explanatory now Nonparametric vs parametric could be a session unto itself. Here, we do not rely on assumptions that the data in question are drawn from a specific probability distribution (such as the normal distribution). MINE statistics leverage the extra information captured by the characteristic matrix to offer more insight into the relationships between variables.

Application Maximum Asymmetry Score (MAS<= MIC) – measures deviations from monotonicity Maximum Edge Value (MEV <= MIC) – measures closeness to being a function (vertical line test ) Minimum Cell Number (MCN) - measures the complexity of an association in terms of the number of cells required

Application MAS – monotonicity MEV – vertical line test MCN – complexity

Usage http://www.exploredata.net/Usage-instructions Application
this takes too long … change it first R: MINE(“MLB2008.csv”,”one.pair”,var1.id=2,var2.id=12) Java: java -jar MINE.jar MLB2008.csv -onePair 2 12 Seeks relationships between salary and home runs, 338 pairs

Notes Does not work on textual data (must be numeric)
Application Notes Does not work on textual data (must be numeric) Long execution times Outputs MIC and other mentioned MINE statistics, not the Characteristic Matrix Output is .csv, a row per variable pair

Application Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License You are free to: to copy, distribute and transmit the work With the following conditions: Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). Noncommercial — You may not use this work for commercial purposes. No Derivative Works — You may not alter, transform, or build upon this work.

Now What? Data Triage Pipeline
Application Now What? Data Triage Pipeline Complex Data Set MIC Ranked list of variable relationships to examine in more depth with the tool(s) of your choice

Lingering Questions Can this be extended to higher-dimensional relationships? Just how approximate is the current MIC algorithm? Who wants to develop an open source implementation? What other MINE statistics are waiting for discovery? Execution time – the algorithm is embarrassingly parallel – easily HADOOPified Many tests reported by the paper only introduced vertical noise into the data? There is also some question as to its power vs Pearson and Dcor (http://www-stat.stanford.edu/~tibs/reshef/comment.pdf)

Power Comment by N. Simon and R. Tibshiran Power Power The power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is actually false (i.e. the probability of not committing a Type II error, or making a false negative decision). The power is in general a function of the possible distributions, often determined by a parameter, under the alternative hypothesis. As the power increases, the chances of a Type II error occurring decrease. The probability of a Type II error occurring is referred to as the false negative rate (β). Therefore power is equal to 1 − β, which is also known as the sensitivity. Power Noise Level Noise Level

Backup Slides