Data Mining extracting knowledge from a large amount of data

Slides:



Advertisements
Similar presentations
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Advertisements

Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Navneet Goyal. Instance Based Learning  Rote Classifier  K- nearest neighbors (K-NN)  Case Based Resoning (CBR)
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
By Fernando Seoane, April 25 th, 2006 Demo for Non-Parametric Classification Euclidean Metric Classifier with Data Clustering.
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
Chapter 2: Pattern Recognition
CSE803 Fall Pattern Recognition Concepts Chapter 4: Shapiro and Stockman How should objects be represented? Algorithms for recognition/matching.
1-NN Rule: Given an unknown sample X decide if for That is, assign X to category if the closest neighbor of X is from category i.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Data Mining Classification: Alternative Techniques
CES 514 – Data Mining Lec 9 April 14 Mid-term k nearest neighbor.
Classification of Remotely Sensed Data General Classification Concepts Unsupervised Classifications.
Stockman CSE803 Fall Pattern Recognition Concepts Chapter 4: Shapiro and Stockman How should objects be represented? Algorithms for recognition/matching.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
GEOMETRIC VIEW OF DATA David Kauchak CS 451 – Fall 2013.
1 Pattern Recognition Concepts How should objects be represented? Algorithms for recognition/matching * nearest neighbors * decision tree * decision functions.
Image Classification 영상분류
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:
K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.
CSSE463: Image Recognition Day 11 Lab 4 (shape) tomorrow: feel free to start in advance Lab 4 (shape) tomorrow: feel free to start in advance Test Monday.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 6: Nearest and k-nearest Neighbor Classification.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.
CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.
Overview Data Mining - classification and clustering
Fast Similarity Metric Based Data Mining Techniques Using P-trees: k-Nearest Neighbor Classification  Distance metric based computation using P-trees.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Multimedia Data Mining using P-trees* William Perrizo,William Jockheck, Amal Perera, Dongmei Ren, Weihua Wu, Yi Zhang Computer Science Department North.
Data Transformation: Normalization
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Efficient Image Classification on Vertically Decomposed Data
Classification of Remotely Sensed Data
CSSE463: Image Recognition Day 11
Classification Nearest Neighbor
Map of the Great Divide Basin, Wyoming, created using a neural network and used to find likely fossil beds See:
K Nearest Neighbors and Instance-based methods
Efficient Image Classification on Vertically Decomposed Data
K Nearest Neighbor Classification
Classification Nearest Neighbor
CSSE463: Image Recognition Day 11
Nearest-Neighbor Classifiers
Research Areas Christoph F. Eick
A Fast and Scalable Nearest Neighbor Based Classification
Prepared by: Mahmoud Rafeek Al-Farra
Image Information Extraction
Instance Based Learning
Lecture 7: Simple Classifier (KNN)
The Alpha-Beta Procedure
Advanced Artificial Intelligence Classification
Nearest Neighbors CSC 576: Data Mining.
Data Mining Classification: Alternative Techniques
CSE4334/5334 Data Mining Lecture 7: Classification (4)
CSSE463: Image Recognition Day 11
Dept. of Computer Science University of Liverpool
CSSE463: Image Recognition Day 11
Notes from 02_CAINE conference
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
pTrees predicate Tree technologies
Presentation transcript:

Data Mining extracting knowledge from a large amount of data Useful Information (sometimes 1 bit: Y/N) More data volume = less information Data Mining Raw data Information Pyramid Functionalities: feature selection, association rule mining, classification & prediction, cluster analysis, outlier analysis. ( of these, the most important may be CLASSIFICATION! )

Classification Predicting the class of an unclassified data sample based on some history (Training Data). B c3 b3 a3 A c2 b2 a2 c1 b1 a1 Class Feature3 Feature2 Feature1 Training data: Classifier c b a Unclassified sample: Predicted class of that sample Eager classifier: Builds a classifier model in advance e.g. a decision tree, a trained neural network... Lazy classifier: Uses the training data each time e.g. k-nearest neighbor

k-Nearest Neighbor (kNN) Classification and Closed-k-Nearest Neighbor (CkNN) Classification 1)  Select a suitable value for k    2) Determine a suitable distance or similarity notion. 3) Find the k nearest neighbor set [closed] of the unclassified sample. 4)  Find the plurality class in the nearest neighbor set. 5) Assign the plurality class as the predicted class of the sample T is the unclassified sample. Use Euclidean distance. k = 3: Find 3 closest neighbors. Move out from T until ≥ 3 neighbors T kNN arbitrarily select one point from that boundary line as 3rd nearest neighbor, whereas, CkNN includes all points on that boundary line. That's 2 ! That's 1 ! That's more than 3 ! CkNN yields higher classification accuracy than traditional kNN. At what additional cost? Actually, at negative cost (faster and more accurate!!)

Performance Experimented on two sets of (Arial) Remotely Sensed Images of Best Management Plot (BMP) of Oakes Irrigation Test Area (OITA), ND Data contains 6 bands: Red, Green, Blue reflectance values, Soil Moisture, Nitrate, and Yield (class label). Band values ranges from 0 to 255 (8 bits) Considering 8 classes or levels of yield values

Training Set Size (no. of pixels) Performance – Accuracy (3 horizontal methods in middle, 3 vertical methods (the 2 most accurate and the least accurate) 1997 Dataset: 80 75 70 65 Accuracy (%) 60 55 kNN-Manhattan kNN-Euclidian 50 kNN-Max kNN using HOBbit distance P-tree Closed-KNN0-max Closed-kNN using HOBbit distance 45 40 256 1024 4096 16384 65536 262144 Training Set Size (no. of pixels)

Training Set Size (no of pixels) Performance – Accuracy (3 horizontal methods in middle, 3 vertical methods (the 2 most accurate and the least accurate) 1998 Dataset: 65 60 55 50 45 Accuracy (%) 40 kNN-Manhattan kNN-Euclidian kNN-Max kNN using HOBbit distance P-tree Closed-KNN-max Closed-kNN using HOBbit distance 20 256 1024 4096 16384 65536 262144 Training Set Size (no of pixels)

Training Set Size (no. of pixels) Per Sample Classification time (sec) Performance – Speed (3 horizontal methods in middle, 3 vertical methods (the 2 fastest (the same 2) and the slowest) Hint: NEVER use a log scale to show a WIN!!! 1997 Dataset: both axis in logarithmic scale Training Set Size (no. of pixels) 256 1024 4096 16384 65536 262144 1 0.1 0.01 Per Sample Classification time (sec) 0.001 0.0001 kNN-Manhattan kNN-Euclidian kNN-Max kNN using HOBbit distance P-tree Closed-KNN-max Closed-kNN using HOBbit dist

Training Set Size (no. of pixels) Per Sample Classification Time (sec) Performance – Speed (3 horizontal methods in middle, 3 vertical methods (the 2 fastest (the same 2) and the slowest) Win-Win situation!! (almost never happens) P-tree CkNN and CkNN-H are more accurate and much faster. kNN-H is not recommended because it is slower and less accurate (because it doesn't use Closed nbr sets and it requires another step to get rid of ties (why do it?). Horizontal kNNs are not recommended because they are less accurate and slower! 1998 Dataset : both axis in logarithmic scale Training Set Size (no. of pixels) 256 1024 4096 16384 65536 262144 1 0.1 0.01 Per Sample Classification Time (sec) 0.001 0.0001 kNN-Manhattan kNN-Euclidian kNN-Max kNN using HOBbit distance P-tree Closed-kNN-max Closed-kNN using HOBbit dist

Association of Computing Machinery KDD-Cup-02 NDSU Team