Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

Slides:

Advertisements

Similar presentations

Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.

Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.

Brief introduction on Logistic Regression

Component Analysis (Review)

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Dimensionality Reduction PCA -- SVD

Dimension reduction (1)

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.

Principal Component Analysis CMPUT 466/551 Nilanjan Ray.

“Random Projections on Smooth Manifolds” -A short summary

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Exploratory Data Mining and Data Preparation

Principal Component Analysis

Dimensional reduction, PCA

19-1 Chapter Nineteen MULTIVARIATE ANALYSIS: An Overview.

Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.

Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.

Unsupervised Learning

Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact.

NonLinear Dimensionality Reduction or Unfolding Manifolds Tennenbaum|Silva|Langford [Isomap] Roweis|Saul [Locally Linear Embedding] Presented by Vikas.

Lightseminar: Learned Representation in AI An Introduction to Locally Linear Embedding Lawrence K. Saul Sam T. Roweis presented by Chan-Su Lee.

Nonlinear Dimensionality Reduction by Locally Linear Embedding Sam T. Roweis and Lawrence K. Saul Reference: "Nonlinear dimensionality reduction by locally.

Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.

Radial Basis Function Networks

Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.

Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University

The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.

Summarized by Soo-Jin Kim

Chapter 2 Dimensionality Reduction. Linear Methods

The Knowledge Discovery Process; Data Preparation & Preprocessing

Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.

Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.

Sampling Design and Analysis MTH 494 LECTURE-12 Ossam Chohan Assistant Professor CIIT Abbottabad.

Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)

ISOMAP TRACKING WITH PARTICLE FILTER Presented by Nikhil Rane.

Manifold learning: MDS and Isomap

Jan Kamenický.  Many features ⇒ many dimensions  Dimensionality reduction ◦ Feature extraction (useful representation) ◦ Classification ◦ Visualization.

Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.

Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,

Principal Component Analysis (PCA)

Feature Extraction 主講人：虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.

3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.

Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Data Transformation: Normalization

Chapter 7. Classification and Prediction

Dimensionality Reduction

Noisy Data Noise: random error or variance in a measured variable.

LECTURE 10: DISCRIMINANT ANALYSIS

UNIT-2 Data Preprocessing

Machine Learning Basics

Clustering (3) Center-based algorithms Fuzzy k-means

Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE

K Nearest Neighbor Classification

Principal Component Analysis

Lecture 7: Data Preprocessing

Feature space tansformation methods

Data Transformations targeted at minimizing experimental variance

LECTURE 09: DISCRIMINANT ANALYSIS

Chapter 7: Transformations

NonLinear Dimensionality Reduction or Unfolding Manifolds

Marios Mattheakis and Pavlos Protopapas

Presentation transcript:

Data Reduction

1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality

Data Reduction 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality

Overview Data reduction techniques can be applied to achieve a reduced representation of the data set. The goal is to provide the mining process with a mechanism to produce the same (or almost the same) outcome when it is applied over reduced data instead of the original data.

Overview Data Reduction techniques are usually categorized into three main families: – Dimensionality Reduction: ensures the reduction of the number of attributes or random variables in the data set. – Sample Numerosity Reduction: replaces the original data by an alternative smaller data representation – Cardinality Reduction: transformations applied to obtain a reduced representation of the original data

Data Reduction 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality

The Curse of Dimensionality Dimensionality becomes a serious obstacle for the efficiency of most of the DM algorithms. It has been estimated that as the number of dimensions increase, the sample size needs to increase exponentially in order to have an effective estimate of multivariate densities.

The Curse of Dimensionality Several dimension reducers have been developed over the years. Linear methods: – Factor analysis – MultiDimensional Scaling – Principal Components Analysis Nonlinear methods: – Locally Linear Embedding – ISOMAP

The Curse of Dimensionality Feature Selection methods are aimed at eliminating irrelevant and redundant features, reducing the number of variables in the model. To find an optimal subset B that solves the following optimization problem: J(B)  criterion function A  Original set of Features Z  Subset of Features d  minimum nº features

The Curse of Dimensionality: Principal Components Analysis Find a set of linear transformations of the original variables which could describe most of the variance using a relatively fewer number of variables. It is usual to keep only the first few principal components that may contain 95% or more of the variance of the original data set. PCA is useful when there are too many independent variables and they show high correlation.

The Curse of Dimensionality: Principal Components Analysis Procedure: – Normalize input data. – Compute k orthonormal vectors (principal components). – Sort the PCs according to their strength, given by their eigenvalues. – Reduce data by removing weaker components.

The Curse of Dimensionality: Principal Components Analysis

The Curse of Dimensionality: Factor Analysis Similar to PCA in the deduction of a semaller set of variables. Different because it does not seek to find transformations for the given attributes. Instead, its goal is to discover hidden factors in the current variables.

The Curse of Dimensionality: Factor Analysis The factor analysis problem can be stated as given the attributes A, along with the mean μ, we endeavor to find the set of factors F and the associated loadings L, and therefore the next equation is accurate.

The Curse of Dimensionality: Factor Analysis

The Curse of Dimensionality: MultiDimensional Scaling Method for situating a set of points in a low space such that a classical distance measure (like Euclidean) between them is as close as possible to each pair of points. We can compute the distances in a dimensional space of the original data and then to use as input this distance matrix, which then projects it in to a lower-dimensional space so as to preserve these distances.

The Curse of Dimensionality: Locally Linear Embedding LLE recovers global nonlinear structure from locally linear fits. Its main idea is that each local patch of the manifold can be approximated linearly and given enough data, each point can be written as a linear, weighted sum of its neighbors.

Data Reduction 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality

Data Sampling To reduce the number of instances submitted to the DM algorithm. To support the selection of only those cases in which the response is relatively homogeneous. To assist regarding the balance of data and occurrence of rare events. To divide a data set into three data sets to carry out the subsequent analysis of DM algorithms.

Data Sampling Forms of data sampling (T  data set, N  nº examples): Simple random sample without replacement (SRSWOR) of size s: This is created by drawing s of the N tuples from T (s < N), where the probability of drawing any tuple in T is 1/N. Simple random sample with replacement (SRSWR) of size s: Similar to SRSWOR, except that each time a tuple is drawn from T, it is recorded and replaced.

Data Sampling Forms of data sampling (T  data set, N  nº examples): Balanced sample: The sample is designed according to a target variable and is forced to have a certain composition according to a predefined criterion. Cluster sample: If the tuples in T are grouped into G mutually disjointed clusters, then an SRS of s clusters can be obtained, where s < G. Stratified sample: If T is divided into mutually disjointed parts called strata, a stratified sample of T is generated by obtaining an SRS at each stratum

Data Sampling: Data Condensation It emerges from the fact that naive sampling methods, such as random sampling or stratified sampling, are not suitable for real-world problems with noisy data since the performance of the algorithms may change unpredictably and significantly. They attempt to obtain a minimal set which correctly classifies all the original examples.

Data Sampling: Data Squashing A data squashing method seeks to compress, or “squash”, the data in such a way that a statistical analysis carried out on the compressed data obtains the same outcome that the one obtained with the original data set. Schemes: – Original Data Squashing (DS). – Likelihood-based Data Squashing (LDS).

Data Sampling: Data Clustering Clustering algorithms partition the data examples into groups, or clusters, so that data samples within a cluster are “similar” to one another and different to data examples that belong to other clusters. Clusters the data could be used instead of the actual data.

Data Sampling: Data Clustering

Data Reduction 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality

Binning and Reduction of Cardinality Binning is the process of converting a continuous variable into a set of ranges. (0–5,000; 5,001–10,000; 10,001–15,000,..., etc.) Cardinality reduction of nominal and ordinal variables is the process of combining two or more categories into one new category. Binning is the easiest and direct way of discretization.