(Big) data analysis: Lecture 5, going into high dimensional datasets

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Linear Regression.
Probabilistic analog of clustering: mixture models
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Dimension reduction (1)
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Visual Recognition Tutorial
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Mutual Information Mathematical Biology Seminar
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Lecture 5: Learning models using EM
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Independent Component Analysis (ICA) and Factor Analysis (FA)
Expectation Maximization Algorithm
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Unsupervised Learning
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Separate multivariate observations
Clustering Unsupervised learning Generating “classes”
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Summarized by Soo-Jin Kim
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Sampling and estimation Petter Mostad
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
4.0 - Data Mining Sébastien Lemieux Elitra Canada Ltd.
PREDICT 422: Practical Machine Learning
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Exploring Microarray data
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Principal Component Analysis
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Dimension reduction : PCA and Clustering
Feature space tansformation methods
Generally Discriminant Analysis
Text Categorization Berlin Chen 2003 Reference:
Test #1 Thursday September 20th
Marios Mattheakis and Pavlos Protopapas
Presentation transcript:

(Big) data analysis: Lecture 5, going into high dimensional datasets Amos Tanay Slides made available as lecture notes – not intended for presentation

Objects and features Practical problem involve a large number of objects, and a large number of features defining the objects Given a Matrix X={x_ij} i=1..N, j=1..M, we can study the object correlation matrix, or the feature correlation matrix: Co = {cor(e_i*, e_j*)} Cf = {cor_*i, cor_*j)} Correlations can be assesed using different approaches, for eaxmple, using the linear covariance matrix: XXT for the feature covariance matrix XTX – for the object covariance matrix If we normalize X such that mean(x_i*)=0, std(x_i*)=1, the covariance=correlation

Correlation matrices as an example for multiple testing We perform n*(n-1)/2 tests for correlation of objects We perform m*(m-1)/2 tests for correlation of objects On random data, the correlation p-value should be distributed uniformly in [0,1] Which mean that there are 0.05* n*(n-1)/2 pairs with p<0.05…or n*(n-1)/2000 with p<0.001 It is not good thing.

Controlling False Discovery Rate (FDR) One can control the multiple testing effect by eliminating tests (filtering bad objects, bad features, or focusing on a-priori interesting tests) More generally, we can compute the false discovery rate of a p-value threshold T as: q=Pr(pvalue on random < T)/Pr(pvalue < T) For example q = 0.1 would suggest that only 10% of the test we report (Pvalue<T) are expected to be random. It is common to search for argmaxT(q(T)<FDR), and using this threshold for reporting results with controlled FDR. Note that Pr(pvalue on random < T) is uniformly distributed if the test is well defined, but may require resampling in more complex situations The FDR concept is strongly based on the notion of tests independence. The covariance matrix have very strong dependencies , so FDR cannot be used on it directly. The FDR concept is orignally based on Benjamini and Hochberg (TAU)

Indirect correlation If cor(X,Y)>>0 and cor(X,Z)>>0, what can you say about cor(Y,Z)? (nothing in general – find examples) However in may cases cor(Y,Z) will be correlated in a way that is explicable by X Binormal(X,Y|cov1), Binormal(X,Z|cov2) -> Pr(Y,Z)? For binomial or multinomial data: conditional independence: Y indep Z | X (set Bayes net theory) How to approach Cor(Y,Z|X) = ? Cor(Y-X, Z-X) – bad idea! Cor(Y-X, Z)? Cor(Y,Z|X=X0), Cor(Y,Z| X in bin)

Hidden factors In many cases major parts of the correlation structure can be dependent on a hidden variable Examples: experimental batch, day in week, operator, doctor Cor(Xi, h) > 0 for each i….then: Cor(X,Y)>0 for all pairs… In that case, none of the observed variable can eliminate the dependencies: Cor(Y,Z|X)>0… (find an example)

Normalization: linear As said, common normalization approach is to center and scale the rows or columns of the data matrix (mean = 0, stdev = 1). If we want to factor out an effector variable X we can subtract the regression Y-f(X) but then be aware of the indirect correlation introduced Questions – what will happen if we normalize Y by: Ynorm = Y-sample(N(mean = f(X), stdev=(1-cor(X,Y)2))

Normalization: binning If X is a variable with strong correlation to many other variables (Y,Z,W..) we will in many case wish to stratify these variable given X without assuming any parameterization. This can be done by binning all objects to ranges of X values and observing the distributions of Y,Z,W within each bin. If bins are suffiiently large we can Subtract from each variable its mean or its linear regression with X within the bin Y’(i) = Y(i) – mean(Y(i)|X(i) in bin b(i)), or Y’(i) = Y(i) – f(X(i)|X(i) in bin b(i)) (where f is the linear fit in the bin)(

Quantile normalization Normalization can be done by shifting (Adding a constant) and scaling (multiplying by a constant). This is adequate for normallly distributed data. Less so for other type of data. A more aggressive scheme is to force the entire distribution of values to be fixed. This can be done using quantile normalization. Define the percentile(X(i), X) as the normalized rank of X(i) within all X values. Define the quantile(i, X) to be the value of the i percentile in X Quantile normaliztion of Y given X is defined as: Y’(i) = quantile(X, percentile(Y(i)) After this normalization Y and X have the same distribution, not just mean or variance. One can use quantile normalization within binning stratification as in the previous slide (instead of subtracting the mean). The approach can be very problematic with respect to outliers

Forcing low dimension: PCA XTX – the empirical covariance matrix (if columns in X are normalized to standard normal (mean = 0, var=1), scaled by n What is the normalized linear combination (in simple words: weighted average) of features that maximize variance? w – weights (such that Swi = 1) what is argmaxw(Si (xi**w)2))? or: w = argmax (wTXTXw) From linear algebra we know w is the eigenvector of the largest eigenvalue of the covariance matrix: (XXT)w1 = l1w1 We can project the matrix on the orthogonal complement of the first eigenvector: X2 = X – Xw1w1T Now we can find the largest eigenvector of X2, defining w2, and so on for w3,.. We are in fact transforming the space to a new basis of eigenvectors. The projection of each data vector on the PCs (the w’s) is called the PC scores

PCA is easy to compute, used for dimensionality reduction To perform PCA, we will usually use Singular Value Decomposition (SVD): X = USWT where U – n by n orthogonal, W – m by m orthogonal, S – diagonal Some nice properties of SVD are: a) XTX = WSUTUSWT =WS2WT b) SVD is computed efficiently and is usually numerically stable (see NRC 2.6) The PCA projection is computed by XW but this is just US We can use the top k PCs to get a data matrix of rank k (so effective dimension of k) that is maximally similar to the original matrix.

PCA can be used on “objects” or “features” Examples – two possible “orientations” for PCA of a gene expression matrix (tissues X genes) 1) PCA of single cell gene expression can done with tissues in the columns. Each gene will be projected on a reduced “virtual tissues” space. You can imagine such PCs to instruct Re tumor/normal behavior etc. 2) If genes are in columns, each tissue will be projected on dimension that combine genes in to some “consolidated expression signature”. You can imagine such signatures to instruct on gene expression modules (groups of genes that are working together)

Do not to use PCA as your major/first data analysis approach PCA is very popular – some reasons include: It is easy to compute, available in matlab/R I generate projections on 2D that are nice to look at It does not require parameters of any kind It have nice theoretical properties, especially when thinking about machine learning applications (e.g. feature selection for classificaiton) Nevertheless, In the context of data analysis, I am tempted to say that PCA is to be avoided in 99% of the situations. The PCs are linear combination of properties – this is seldom a good assumption The PCs are “hiding” the data and generate plots that cannot be directly attributed to the data (e.g. problematic features, outliers, various biases) The PCs are very difficult to interpret unless there are 2 strong behaviors in the data – the visual gain is lost when going beyond 2D It is recommended to approach high dimensional data with many measurements using first an exploratory approach (as we outline below), and to perform dimensionality reduction and high-dimension modeling in a way that is tailored to the problem

The other off-the-shelf high-dim analysis solution: hierarchical clustering Given a set of vectors Xi We are building a tree on the objects i=1..n, define by the relation Pa(i) (parent of i). Compute distances d(Xi, Xj). Distances can use pearson, spearman, eucledian distance, or essentially anything. Add i=1..n to the list of objects L. If using weights set w(i) = 1 for i=1..n Find the minimal distance pair i1,i2 Add a new object n+1, set Pa(i1) = n+1, Pa(i2) = n+1. Update d(n+1,j) for each j using one policy: 1. Max(d(j,i1), d(j,i2)) or Min(.,.) (maximum,minimum linkage) 2. WeightedMean = 1/(wi1+wi1)[wi1d(j,i1) + wi2d(j,i2)] 3. Set Xn = Xi1 + Xi2, d(n+1,j) = d(Xn+1,Xj) (possibly weighting X using w) Remove i1,i2 from the list L. If using weights, set wn+1 = wi1 + wi2 Continue iteratively (Adding objects n+2,… decreasing the size of L by 1 each time until convergence).

Hclust- caveats Hcluster is popular for visualizing the data in heat maps (ordering the objects or features) – but the tree is _not_ defining an order (we can inverse the order over nodes) Effective to detect big divisions in the data Of course - effective when hierarchical structure is expected. But when the data is not clearly partitioned/divided/hierarchical, hclust will be difficult to understand and work with May make sense for visualizing the correlation matrix more than the data matrix Can look at the original data which is a good thing (e.g. PCA issues), but smoothing can introduce visual artifacts that mislead novices Limited as a basis for further modeling

A more organized intro to clustering Clustering is a fuzzy task! No canonical goals or quality metric We partition a group of objects into clusters: Property 1: maximize the similarity of objects within clusters Property 2: maximize the separation between clusters Property 3: use the right number of clusters The tradeoffs between 1/2 are not well defined! (think of examples) In exploratory data analysis our goals may be a bit different: Property A: Capture all of the data by a relatively small number of well define models that are easy to understand/visualize Property B: the model defining each cluster is well determined Property C: Random data will not contain cluster significantly different from the background distribution of all data

Clustering: k-means Given a set of vectors Xi We optimize a clustering solution C1+C2+..+Ck = 1..n, Ci disjoint. Uniquely defining cl(i) = the cluster of element i K is fixed and we try to minimize the disperssion of our clusters, defined by: Centerj = 1/|Cj|Si in CjXi score= Si ||Xi – centercl(i)||2 (Other schemes possible) This is done by brute-force optimization: Incremental algorithm: (starting from any solution C) Examine all possible cluster substitutions cl(i) = j->l If one substitution improve score, apply it and continue with the new solution iteratively Iterative algorithm: (starting from any solution C) Compute centers. Update C such that cl(i) = the nearset center to Xi.

Initialization K means, as most algorithms we will learn in the coming lectures, is super-sensitive to initialization. A large number of random starting poitns can be attempted – selecting the highest scoring solution An effective heuristic for initialization is selecting seeds successively Given i seeds we rank all objects given their minimal distance to the existing seeds. We sample the i+1 seed from the objects with minimal distance within the top 20 percentiles (so preferring new seeds that are far away from existing seeds). Many other algorithmic variants are possible – replacing centers with medians, changing the distance or scoring scheme, adding new seeds for clusters that become empty, trying to merge or split clusters (change k) during the run.

Using K-mean for data exploration: key points 1. We select a number of clusters that is larger than what we think we need 2. This is ok as long as clusters are not becoming too small. The guildeine is that we approximate a complex distribution using pieces we can understand. 3. An appropriate size of a cluster depends on the statistics we collect – each feature is defining a 1D distribution within each cluster – this distribution should be appropriately determined (As we covered in the first lectures) 4. Observing several similar clusters (with slight overfitting in each) can be handled in post-process – do not try to optimize the number of clusters! 5. Frequently most of the data is “boring” (similar to the example of a 2D scatter-plot that is focused at 0). The goal function of most clustering algorithm will not allow appropriate modelling of strongly imbalanced clusters – it will be important to filter such cases in preprocess (but take note of their existence and frequency!) 6. Combinatorial effects can be problematic for cluster in general, k-means in particular. For example, 4 independt groups of features that parititon the objects into two distinct group each will generate 16 cluster in theory – with model for the connetion between clusters. This should be diagnosed and handled using more sophisticated models.

Back to probabilistic modeling – multivariate normal distribution Generalizing 2D correlation with a set of pairwise covariance – the covariance matrix |S| - det(S) Multi-normality is a very strong restriction on data – although the model have many – d2 - parameters and it prone to sever overfitting Restricting the covariance structure can be approached in many ways.

Probabilistic analog of clustering: mixture models We model uncertainty in the identification of objects to mixture components Each mixture component can be based on some appropriate model Independent discrete (binomial,multinormial) Independent normals Mutlivariate normals More complex dependency model (e.g. Tree, Bayesian network, Markov Random Field…) This is our first model with a hiden variable T->X or P(T,X) = P(T)*P(X|T) The probability of the data is derived by marginzation over the hidden variables

Expectation Maximization (EM) (Working with e.g. a mixture of independent or correlated normal distributions) Initialize the mixture component parameters and mixture coefficients: q1, r1 Compute the posterior probabilities: wij Pr(t =j|Xi) Set new mixture coefficients: r2i=(1/m) Sj wij Set new mixture components using weighted data wij, Xi Iterate until convergence (no significant change in qlast, rlast) Update mixture component using weight data – examples: Independent normal: E(X[k]) E(X2[k]) is determined by weighted statistics Full multivariate: all statistics (E(X[k]X[u]) is determined by weighted stats

Expectation-Maximization The most important property of the EM algorithm is that it is guaranteed to improve the likelihood of the data. This DOES NOT MEAN it will find a good solution. Bad initialization will converge to bad outcome. Here t is ranging over all possible assignment of the hidden mixture variable t per sample – so km possibilities EM maximization Relative entropy>=0 Dempster

Expectation-Maximization: Mixture model Q can be brought back to the form of a sum over independent terms, k of them depends only on the parameters of one mixture component, and one depends on the mixture coeeficient. Maximization of each of these component can then be shown to be done as in the non-mixture case (e.g. using the empirical covariance matrix for the multi-normal distribution).

KL-divergence Entropy (Shannon) Kullback-leibler divergence Not a metric!! KL