Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time.

Slides:



Advertisements
Similar presentations
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Advertisements

1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Data Engineering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,

Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining: Concepts and Techniques
Exploratory Data Mining and Data Preparation
Data Mining: Concepts and Techniques — Chapter 3 — Cont.
CSci 8980: Data Mining (Fall 2002)
EECS 800 Research Seminar Mining Biological Data
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Data Preprocessing.
Chapter 4 Topics –Sampling –Hard data –Workflow analysis –Archival documents.
2015年7月2日星期四 2015年7月2日星期四 2015年7月2日星期四 Data Mining: Concepts and Techniques1 Data Transformation and Feature Selection/Extraction Qiang Yang Thanks: J.
Peter Brezany and Christian Kloner Institut für Scientific Computing
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
COSC 4335 DM: Preprocessing Techniques
Data Mining Lecture 2: data.
1 Analysis - Improving GP with Statistics Chap. 8 발표자 : 김정집.
Copyright © 2011 Pearson Education, Inc. Samples and Surveys Chapter 13.
D ATA P REPROCESSING 1. C HAPTER 3: D ATA P REPROCESSING Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization.
INTRODUCTION TO STATISTICS MATH0102 Prepared by: Nurazrin Jupri.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Econ 3790: Business and Economics Statistics Instructor: Yogesh Uppal
CpSc 810: Machine Learning Evaluation of Classifier.
2015年11月6日星期五 2015年11月6日星期五 2015年11月6日星期五 Data Mining: Concepts and Techniques1 Data Preprocessing — Chapter 2 —
What is Data? Attributes
CIS527: Data Warehousing, Filtering, and Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
1 Data Mining Lecture 2: Data. 2 What is Data? l Collection of data objects and their attributes l Attribute is a property or characteristic of an object.
November 24, Data Mining: Concepts and Techniques.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Data Mining and Decision Support
© 2012 Cengage Learning. All Rights Reserved. Principles of Business, 8e C H A P T E R 10 SLIDE Marketing Basics Develop Effective.
Data Preprocessing: Data Reduction Techniques Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
3/13/2016Data Mining 1 Lecture 1-2 Data and Data Preparation Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB) Bangkok.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
1 Web Mining Faculty of Information Technology Department of Software Engineering and Information Systems PART 4 – Data pre-processing Dr. Rakan Razouk.
Data Mining: Data Prepossessing What is to be done before we get to Data Mining?
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
1 Data Mining Lecture 02a: Data Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook authors)
Research Methods and Statistics
Data Transformation: Normalization
Data Mining Lecture 02a: Theses slides are based on the slides by Data
Data Mining: Data Preparation
Data Preprocessing CENG 514 June 17, 2018.
Noisy Data Noise: random error or variance in a measured variable.
UNIT-2 Data Preprocessing
Machine Learning Feature Creation and Selection
CISC 4631 Data Mining Lecture 02:
K Nearest Neighbor Classification
Lecture Notes for Chapter 2 Introduction to Data Mining
Data Mining – Intro.
Lecture 7: Data Preprocessing
Data Preprocessing Modified from
Chapter 1 Data Preprocessing
Data Transformations targeted at minimizing experimental variance
Data Mining Lecture 02a: Theses slides are based on the slides by Data
Data Mining Data Preprocessing
Group 9 – Data Mining: Data
Chapter 2 Data Preprocessing.
Data Pre-processing Lecture Notes for Chapter 2
Presentation transcript:

Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtain a reduced representation of the data set that is much smaller in volume but yet produce the same (or almost the same) analytical results Data reduction strategies Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation Discretization (already covered specially) and Binarization Attribute Transformation

Data Reduction : Aggregation Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction Reduce the number of attributes or objects Change of scale Cities aggregated into regions, states, countries, etc More “ stable ” data Aggregated data tends to have less variability

Data Reduction : Aggregation Standard Deviation of Average Monthly Precipitation Standard Deviation of Average Yearly Precipitation Variation of Precipitation in Australia

Data Reduction : Sampling Sampling is the main technique employed for data selection. It is often used for both the preliminary investigation of the data and the final data analysis. Statisticians sample because obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming.

Data Reduction : Types of Sampling Simple Random Sampling There is an equal probability of selecting any particular item Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once

Sampling Method Allow a mining algorithm to run in complexity that is potentially sub- linear to the size of the data Choose a representative subset of the data Simple random sampling may have very poor performance in the presence of skew Develop adaptive sampling methods Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data Sampling may not reduce database I/Os (page at a time).

Sampling SRSWOR (simple random sample without replacement) SRSWR Raw Data

Sampling Raw Data Cluster/Stratified Sample

Data Reduction Feature Subset Selection Another way to reduce dimensionality of data Redundant features duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA