An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams
Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto, 2005

Outline Related Work ANNCAD Properties of ANNCAD Conclusion

Classifying Data Streams
Problem Statement: We seek an algorithm for classifying data streams with numerical attributes---will work for totally ordered domains too. Desiderata: Fast update speed for newly arriving records. Only require single pass of data. Incremental algorithms are needed. Coping with concept changes. Classical mining algorithms were not designed for data streams and need to replaced or modified. Fast update speed??? Incremental algorithm Numeric or ordered?

Classifying Data Streams: Related Work
Hoeffding trees: VFDT and CVFDT: build decision tree incrementally. Require a large amount of examples to obtain a fair performance classifier. Unsatisfied performance when training set is small. Ensemble: Combine base models by voting technique. Suitable for coping with concept drift. Fail to provide a simple model and understanding of the problem. What besides decision trees?

State of the Art: NearestNeighborhood Classifiers
Pros and cons: +: Strong intuitive appeal and simple to implement -: Fail to provide simple models/rules -: Expensive Computations ANN: Approximate Nearest Neighborhood with error guarantee 1+ε: Idea: pre-processing the data by devising a data structure (e.g. ring-cover tree) to speed up the searchings. Designed for stored data only. Time for update the pre-processing step depends on size of data set, which may be infinite. Time for updates—not clear

Our Algorithm: ANNCAD Adaptive NN Classification Algorithm for Data Streams
Model building: Pre-assign classes to obtain an approximate result and provide simple models/rules. Decompose the feature space to make classification decisions. Akin to wavelets. Classification: Find NN for classification adaptively. progressively expand the searching of nearby area of a test point (star). Wavelets? Why adaptive

Quantize Feature Space and Compute Multi-resolution Coefficients
( )/4 I: blue II: red Quantize Feature Space and record information into data arrays A set of 100 two-class training points Multi-resolution representation of a two-class data set.

Hierarchical structure of ANNCAD Classifier
Building a Classifier B=6.75; R=0.6  Blue B=2; R=4.25  M(ix) B=3; R=3.25  M(ix) Label each block with its majority class Label block only if |C1st|-|C2nd| > 80% Hierarchical structure of ANNCAD Classifier

Decision Algorithm on the ANNCAD Hierarchy
Compute the distance between the test point and the center of every nonempty neighboring block. Classified block  Label class I Unclassified block, go to next level. Block with tag “M”, go back to prev. level. Classified block  Label class II Delay last picture The combined classifier over multiple levels

Incremental Update New training point 8 10 9 2 1 8 10 9 2 1 6.75 2 3
8 10 9 2 1 8 10 9 2 1 6.75 2 3 0.5 6.75 2 3 0.25 3 3.0625 New training point

Concept Drift: Adaptation by Exponential Forgetting
Data Array , Factor 01: new   old No effect if no concept changes Adapt quickly (exponentially) if concept changes No extra memory needed (sliding window required.) Sliding window required?

Grid Position and Resolution
Problem: Neighborhood decision strongly depends on grid position Solution: Build several classifiers by shifting grid position by 1/n. Then combine the results by voting. Thm. x: test point, nd classifiers, b(x): Blocks containing x, then:  zb(x), yb(x): dist(x,y)<(1+1/n-1)*dist(x,z). In practice, only 2-3 classifiers can achieve a good result. Example: 4 different grids for building 4 classifiers.

Properties of ANNCAD Compact support: locality property allows fast update Dealing with noise: can set a threshold for classification decision Multi-resolution: to control the fineness of the result, or optimize the system resources. Low complexity (gd = total number of cells) Building classifier: O(min(N,gd)) Testing: O(log2(g)+2d). Updating: log2(g)+1.

Experiments Synthetic Data 3-d unit cube: Class distribution:
class 0 inside sphere with radius 0.5 class 1 outside 3000 training examples 1000 test examples Exact ANN: Expand the searching area by double the radius until reaching some training point. Classify the test point with the majority class. (a) different initial resolutions. (b) different # ensembles.

Experiments (Cont’) Real Data 1 -- Letter Recognition
Objective: identify a pixel displays as one of the 26 letter. 16 numerical attributes to describe its pixel displays. 15,000 training examples 5,000 test examples Add 5 % noise by randomly assign class. Grid size: 16 units #Classifiers: 2 Number of rescans

ANNCAD Vs VFDT (Very Fast Decision Tree)
Real Data 2 – Forest Cover Type Objective: predict forest cover type. 10 numerical attributes. 12,000 training examples 9,000 test examples Grid size: 32 unit #Classifiers: 2

Concept Shift: ANNCAD vs CVFDT
Real Data 3 – Adult Objective: determine a person with salary>50K Concept Shift Simulation: Group by races  = 0.98 Grid Size: 64 #Classifier: 2 CVFDT Not understood CVFDT: concept adapting VFDT

Conclusion and Future Work
ANNCAD an incremental classification algorithm to find adaptive NN Suitable for mining data streams: fast update speed Exponential forgetting for concept shift/drift. Future Work: Detect concept shift/drift by changes in class label of blocks.

THANK YOU!

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

Similar presentations

Presentation on theme: "An Adaptive Nearest Neighbor Classification Algorithm for Data Streams"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Adaptive Nearest Neighbor Classification Algorithm for Data Streams

Similar presentations

Presentation on theme: "An Adaptive Nearest Neighbor Classification Algorithm for Data Streams"— Presentation transcript:

Similar presentations

About project

Feedback