Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,

Similar presentations


Presentation on theme: "1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,"— Presentation transcript:

1 1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto, 2005

2 2 Outline Related Work ANNCAD Properties of ANNCAD Conclusion

3 3 Classifying Data Streams Problem Statement: We seek an algorithm for classifying data streams with numerical attributes--- will work for totally ordered domains too. Desiderata:  Fast update speed for newly arriving records.  Only require single pass of data.  Incremental algorithms are needed.  Coping with concept changes. Classical mining algorithms were not designed for data streams and need to replaced or modified.

4 4 Classifying Data Streams: Related Work Hoeffding trees:  VFDT and CVFDT: build decision tree incrementally.  Require a large amount of examples to obtain a fair performance classifier.  Unsatisfied performance when training set is small. Ensemble:  Combine base models by voting technique.  Suitable for coping with concept drift.  Fail to provide a simple model and understanding of the problem.

5 5 State of the Art: NearestNeighborhood Classifiers Pros and cons:  +: Strong intuitive appeal and simple to implement  -: Fail to provide simple models/rules  -: Expensive Computations ANN: Approximate Nearest Neighborhood with error guarantee 1+ε:  Idea: pre-processing the data by devising a data structure (e.g. ring-cover tree) to speed up the searchings.  Designed for stored data only.  Time for update the pre-processing step depends on size of data set which may be infinite.

6 6 Our Algorithm: ANNCAD Adaptive NN Classification Algorithm for Data Streams Model building:  Pre-assign classes to obtain an approximate result and provide simple models/rules.  Decompose the feature space to make classification decisions.  Akin to wavelets. Classification:  Find NN for classification adaptively.  progressively expand the searching of nearby area of a test point (star).

7 7 Quantize Feature Space and Compute Multi-resolution Coefficients Quantize Feature Space and record information into data arrays (8+9+10+0)/4 Multi-resolution representation of a two-class data set. A set of 100 two- class training points

8 8 Building a Classifier Hierarchical structure of ANNCAD Classifier B=6.75; R=0.6  Blue B=2; R=4.25  M(ix) Label each block with its majority class Label block only if |C 1st |-|C 2nd | > 80% B=3; R=3.25  M(ix)

9 9 Decision Algorithm on the ANNCAD Hierarchy Unclassified block, go to next level. Block with tag “M”, go back to prev. level. Compute the distance between the test point and the center of every nonempty neighboring block. The combined classifier over multiple levels Classified block  Label class I Classified block  Label class II

10 10 Incremental Update New training point 0880 10900 210 0000 6.752 30.25 3 0880 10900 210 0001 6.752 30.5 3.0625

11 11 Concept Drift: Adaptation by Exponential Forgetting Data Array , Factor 0  1:  new   old No effect if no concept changes Adapt quickly (exponentially) if concept changes No extra memory needed (sliding window required.)

12 12 Grid Position and Resolution Problem: Neighborhood decision strongly depends on grid position Solution: Build several classifiers by shifting grid position by 1/n. Then combine the results by voting. Thm. x: test point, n d classifiers, b(x): Blocks containing x, then:  z   b(x),  y   b(x): dist(x,y)<(1+ 1 / n-1 )*dist(x,z).  In practice, only 2-3 classifiers can achieve a good result. Example: 4 different grids for building 4 classifiers.

13 13 Properties of ANNCAD Compact support: locality property allows fast update Dealing with noise: can set a threshold for classification decision Multi-resolution: to control the fineness of the result, or optimize the system resources. Low complexity ( g d = total number of cells)  Building classifier: O(min(N,g d ))  Testing: O(log 2 (g)+2 d ).  Updating: log 2 (g)+1.

14 14 Experiments Synthetic Data  3-d unit cube:  Class distribution: class 0 inside sphere with radius 0.5 class 1 outside  3000 training examples  1000 test examples Exact ANN:  Expand the searching area by double the radius until reaching some training point.  Classify the test point with the majority class. (a) different initial resolutions. (b) different # ensembles.

15 15 Experiments (Cont’) Real Data 1 -- Letter Recognition  Objective: identify a pixel displays as one of the 26 letter.  16 numerical attributes to describe its pixel displays.  15,000 training examples  5,000 test examples  Add 5 % noise by randomly assign class.  Grid size: 16 units  #Classifiers: 2

16 16 ANNCAD Vs VFDT Real Data 2 – Forest Cover Type  Objective: predict forest cover type.  10 numerical attributes.  12,000 training examples  9,000 test examples  Grid size: 32 unit  #Classifiers: 2

17 17 Concept Shift: ANNCAD vs CVFDT Real Data 3 – Adult  Objective: determine a person with salary>50K  Concept Shift Simulation: Group by races  = 0.98  Grid Size: 64  #Classifier: 2

18 18 Conclusion and Future Work ANNCAD  an incremental classification algorithm to find adaptive NN  Suitable for mining data streams: fast update speed  Exponential forgetting for concept shift/drift. Future Work: Detect concept shift/drift by changes in class label of blocks.

19 19 THANK YOU!


Download ppt "1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,"

Similar presentations


Ads by Google