Exploratory Data Mining and Data Preparation

Slides:



Advertisements
Similar presentations
1 Copyright by Jiawei Han, modified by Charles Ling for cs411a/538a Data Mining and Data Warehousing v Introduction v Data warehousing and OLAP for data.
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Feature Selection Presented by: Nafise Hatamikhah
CSci 8980: Data Mining (Fall 2002)
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Data Preprocessing.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Special Topic: Missing Values. Missing Values Common in Real Data  Pneumonia: –6.3% of attribute values are missing –one attribute is missing in 61%
Classification.
CS Instance Based Learning1 Instance Based Learning.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Chapter 1 Data Preprocessing
Ensemble Learning (2), Tree and Forest
CSCI 347 / CS 4206: Data Mining Module 05: WEKA Topic 04: Data Preparation Tools.
Module 04: Algorithms Topic 07: Instance-Based Learning
Data Engineering Data preprocessing and transformation Data Engineering Data preprocessing and transformation.
Appendix: The WEKA Data Mining Software
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Basic Data Mining Technique
1 Data preparation: Selection, Preprocessing, and Transformation Literature: Literature: I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Classification and Prediction
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Why preprocessing? Learning method needs data type: numerical, nominal,.. Learning method cannot deal well enough with noisy / incomplete data Too many.
Summary „Rough sets and Data mining” Vietnam national university in Hanoi, College of technology, Feb.2006.
© 2002 IBM Corporation IBM Research 1 Policy Transformation Techniques in Policy- based System Management Mandis Beigi, Seraphin Calo and Dinesh Verma.
Data Mining and Decision Support
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Computational Intelligence: Methods and Applications Lecture 34 Applications of information theory and selection of information Włodzisław Duch Dept. of.
@relation age sex { female, chest_pain_type { typ_angina, asympt, non_anginal,
Machine Learning with Spark MLlib
Data Transformation: Normalization
Noisy Data Noise: random error or variance in a measured variable.
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Chapter 6 Classification and Prediction
Data preprocessing and transformation
Waikato Environment for Knowledge Analysis
Roberto Battiti, Mauro Brunato
Classification and Prediction
Data Mining Practical Machine Learning Tools and Techniques
A Unifying View on Instance Selection
Classification & Prediction
Opening Weka Select Weka from Start Menu Select Explorer Fall 2003
Classification and Prediction
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
CSCI N317 Computation for Scientific Applications Unit Weka
Machine Learning in Practice Lecture 23
Data Transformations targeted at minimizing experimental variance
Chapter 7: Transformations
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Feature Selection Methods
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Presentation transcript:

Exploratory Data Mining and Data Preparation Fall 2003 Data Mining

The Data Mining Process Business understanding Data evaluation Evaluation Data preparation Data Deployment Modeling Fall 2003 Data Mining

Exploratory Data Mining Preliminary process Data summaries Attribute means Attribute variation Attribute relationships Visualization Fall 2003 Data Mining

Many missing values (16%) No examples of one value Summary Statistics Possible Problems: Many missing values (16%) No examples of one value Select an attribute Visualization Appears to be a good predictor of the class Fall 2003 Data Mining

Fall 2003 Data Mining

Exploratory DM Process For each attribute: Look at data summaries Identify potential problems and decide if an action needs to be taken (may require collecting more data) Visualize the distribution Identify potential problems (e.g., one dominant attribute value, even distribution, etc.) Evaluate usefulness of attributes Fall 2003 Data Mining

Weka Filters Weka has many filters that are helpful in preprocessing the data Attribute filters Add, remove, or transform attributes Instance filters Add, remove, or transform instances Process Choose for drop-down menu Edit parameters (if any) Apply Fall 2003 Data Mining

Data Preprocessing Data cleaning Data integration/transformation Missing values, noisy or inconsistent data Data integration/transformation Data reduction Dimensionality reduction, data compression, numerosity reduction Discretization Fall 2003 Data Mining

Data Cleaning Missing values Noisy data Weka reports % of missing values Can use filter called ReplaceMissingValues Noisy data Due to uncertainty or errors Weka reports unique values Useful filters include RemoveMisclassified MergeTwoValues Fall 2003 Data Mining

Data Transformation Why transform data? Combine attributes. For example, the ratio of two attributes might be more useful than keeping them separate Normalizing data. Having attributes on the same approximate scale helps many data mining algorithms(hence better models) Simplifying data. For example, working with discrete data is often more intuitive and helps the algorithms(hence better models) Fall 2003 Data Mining

Weka Filters The data transformation filters in Weka include: Add AddExpression MakeIndicator NumericTransform Normalize Standardize Fall 2003 Data Mining

Discretization Discretization reduces the number of values for a continuous attribute Why? Some methods can only use nominal data E.g., in Weka ID3 and Apriori algorithms Helpful if data needs to be sorted frequently (e.g., when constructing a decision tree) Fall 2003 Data Mining

Unsupervised Discretization Unsupervised - does not account for classes Equal-interval binning Equal-frequency binning Fall 2003 Data Mining

Supervised Discretization Take classification into account Use “entropy” to measure information gain Goal: Discretizise into 'pure' intervals Usually no way to get completely pure intervals: 1 yes 8 yes & 5 no 9 yes & 4 no 1 no F E D C B A 64 65 68 69 70 71 72 75 80 81 83 85 Yes No Yes Yes Yes No No Yes No Yes Yes No Yes Yes Fall 2003 Data Mining

Error-Based Discretization Count number of misclassifications Majority class determines prediction Count instances that are different Must restrict number of classes. Complexity Brute-force: exponential time Dynamic programming: linear time Downside: cannot generate adjacent intervals with same label Fall 2003 Data Mining

Weka Filter Fall 2003 Data Mining

Attribute Selection Before inducing a model we almost always do input engineering The most useful part of this is attribute selection (also called feature selection) Select relevant attributes Remove redundant and/or irrelevant attributes Why? Fall 2003 Data Mining

Reasons for Attribute Selection Simpler model More transparent Easier to interpret Faster model induction What about overall time? Structural knowledge Knowing which attributes are important may be inherently important to the application What about the accuracy? Fall 2003 Data Mining

Attribute Selection Methods What is evaluated? Attributes Subsets of attributes Evaluation Method Independent Filters Learning algorithm Wrappers Fall 2003 Data Mining

Filters Results in either Ranked list of attributes Typical when each attribute is evaluated individually Must select how many to keep A selected subset of attributes Forward selection Best first Random search such as genetic algorithm Fall 2003 Data Mining

Filter Evaluation Examples Information Gain Gain ration Relief Correlation High correlation with class attribute Low correlation with other attributes Fall 2003 Data Mining

Wrappers “Wrap around” the learning algorithm Must therefore always evaluate subsets Return the best subset of attributes Apply for each learning algorithm Use same search methods as before Select a subset of attributes Induce learning algorithm on this subset Evaluate the resulting model (e.g., accuracy) Stop? No Yes Fall 2003 Data Mining

How does it help? Naïve Bayes Instance-based learning Decision tree induction Fall 2003 Data Mining

Fall 2003 Data Mining

Scalability Data mining uses mostly well developed techniques (AI, statistics, optimization) Key difference: very large databases How to deal with scalability problems? Scalability: the capability of handling increased load in a way that does not effect the performance adversely Fall 2003 Data Mining

Massive Datasets Very large data sets (millions+ of instances, hundreds+ of attributes) Scalability in space and time Data set cannot be kept in memory E.g., processing one instance at a time Learning time very long How does the time depend on the input? Number of attributes, number of instances Fall 2003 Data Mining

Two Approaches Increased computational power Adapt algorithms Only works if algorithms can be sped up Must have the computing availability Adapt algorithms Automatically scale-down the problem so that it is always approximately the same difficulty Fall 2003 Data Mining

Computational Complexity We want to design algorithms with good computational complexity exponential Time polynomial linear logarithm Number of instances (Number of attributes) Fall 2003 Data Mining

Example: Big-Oh Notation Define n =number of instances m =number of attributes Going once through all the instances has complexity O(n) Examples Polynomial complexity: O(mn2) Linear complexity: O(m+n) Exponential complexity: O(2n) Fall 2003 Data Mining

Classification If no polynomial time algorithm exists to solve a problem it is called NP-complete Finding the optimal decision tree is an example of a NP-complete problem However, ID3 and C4.5 are polynomial time algorithms Heuristic algorithms to construct solutions to a difficult problem “Efficient” from a computational complexity standpoint but still have a scalability problem Fall 2003 Data Mining

Decision Tree Algorithms Traditional decision tree algorithms assume training set kept in memory Swapping in and out of main and cache memory expensive Solution: Partition data into subsets Build a classifier on each subset Combine classifiers Not as accurate as a single classifier Fall 2003 Data Mining

Other Classification Examples Instance-Based Learning Goes through instances one at a time Compares with new instance Polynomial complexity O(mn) Response time may be slow, however Naïve Bayes Polynomial complexity Stores a very large model Fall 2003 Data Mining

Data Reduction Another way is to reduce the size of the data before applying a learning algorithm (preprocessing) Some strategies Dimensionality reduction Data compression Numerosity reduction Fall 2003 Data Mining

Dimensionality Reduction Remove irrelevant, weakly relevant, and redundant attributes Attribute selection Many methods available E.g., forward selection, backwards elimination, genetic algorithm search Often much smaller problem Often little degeneration in predictive performance or even better performance Fall 2003 Data Mining

Data Compression Also aim for dimensionality reduction Transform the data into a smaller space Principle Component Analysis Normalize data Compute c orthonormal vectors, or principle components, that provide a basis for normalized data Sort according to decreasing significance Eliminate the weaker components Fall 2003 Data Mining

PCA: Example Fall 2003 Data Mining

Numerosity Reduction Replace data with an alternative, smaller data representation Histogram 1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15, 15,15,15,18,18,18,18,18,18,18,18,20,20,20,20,20, 20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30 count 1-10 11-20 21-30 Fall 2003 Data Mining

Other Numerosity Reduction Clustering Data objects (instance) that are in the same cluster can be treated as the same instance Must use a scalable clustering algorithm Sampling Randomly select a subset of the instances to be used Fall 2003 Data Mining

Sampling Techniques Different samples Sample without replacement Sample with replacement Cluster sample Stratified sample Complexity of sampling actually sublinear, that is, the complexity is O(s) where s is the number of samples and s<<n Fall 2003 Data Mining

Weka Filters PrincipalComponents is under the Attribute Selection tab Already talked about filters to discretize the data The Resample filter randomly samples a given percentage of the data If you specify the same seed, you’ll get the same sample again Fall 2003 Data Mining