Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.

Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer Science and Engineering Department Lehigh University

2/23/10 Background Document image analysis Pixel-accurate document image content extraction k-Nearest Neighbor (kNN) classifier is suitable for this problem

2/23/10 K Nearest Neighbors Classifier For each test sample, find the k(eg. 5) nearest training samples, and choose the most frequent class among k. Problems (Both space and time) Training sets (space) --too large to fit in main memory Brute force (time) -- must calculate all the distances of training samples How to speed up?

2/23/10 Related work K-d trees (Bentley et. al., 1975) Voronoi methods (e.g. Preparata & Shamos, 1985) ANN (Arya, Mount, 2001) Locality Sensitive Hashing scheme (Indyk et. Al, 2005) Hashed K-d trees (Baird, Casey, Moll, 2006)

2/23/10 Hashed k-d trees Split feature space into a large number of bins Training and test samples fall into bins Hashing into bins makes it fast Calculate distances of samples only within each bin May not find the exact k nearest neighbors loss of accuracy is often small

2/23/10 Pre-decimation Throw away, at random, most of the training samples, before loading into bins It saves both space and time: 9 times speedup Loss of accuracy, again, is small: less than 1%.

2/23/10 Pre-decimation Throw away, at random, most of the training samples, before loading into bins It saves both space and time: 9 times speedup Loss of accuracy, again, is small: less than 1%. But….

2/23/10 Pre-decimation problems

2/23/10 Bin-decimation The key idea of bin-decimation is to enforce an upper bound M approximately on the number of training samples stored in each bin We propose an adaptive statistical technique to do this online (while reading the training data exactly once), and in linear time.

2/23/10 Bin-decimation

2/23/10 Bin-decimation If we read the training data twice, we can easily enforce the bound M exactly on every bin---but this is slow. We can read the training data only once (“on-line”), and still enforce the bound approximately, if this assumption holds: For every bin, the samples falling in that bin tend to be distributed uniformly within the sequence of training samples, in the order in which they are read.

2/23/10 Online Bin-decimation The total number of training sample N At time t Nt the total current number of samples which have been read Nt(b) the current number of samples which have fallen into the bin b Se(b) the estimated number of samples which belong in bin b At time t, read training samples, the probability of keeping this sample is With this probability, we pseudorandomly keep this sample.

2/23/10 Experiments the training set contains 1,658,060 samples. the test set contains 340,054 samples. test and training images have been collected from books, magazines, newspapers, technical articles, and notes of students, et. Each pixel is a sample

2/23/10 Pre-decimation results Runtime and accuracy (on separate scales), as functions of the pre-decimation factor. Up to a factor of 1/100, accuracy falls only 6%, while runtime falls by a factor of 100.

2/23/10 Pre-decimation results The number of unclassifiable samples, and accuracy, as functions of the pre-decimation factor. For factors beyond 1/100, the number of unclassifiable samples increases dramatically.

2/23/10 Bin-decimation results Accuracy and runtime of bin-decimation as functions of M, the runtime parameter controlling maximum bin size. Accuracy remains nearly unaffected until M falls below 5, whereas runtime drops very significantly even for M greater than 100.

2/23/10 Comparison Comparison of bin-decimation vs. pre-decimation using parameter chosen so that they consume roughly the same runtime (18 CPU seconds). Note that bin-decimation achieves a higher accuracy (roughly 6% better). Comparison of bin-decimation vs. pre-decimation using parameters chosen so that they achieve roughly the same accuracy (roughly 77% correct). Note that bin-decimation consumes less time (less than 1/10th).

2/23/10 A larger-scale experiment The training set contains 33 images, a total of 86.7M samples. The test set contains 83 images, a total of 221.6M samples. Experiment environment: high performance computing (Beowulf cluster) at Lehigh University. The HPC Cluster contains 40 nodes and each node is equipped 8 core Intel Xeon 1.8GHz and 16GB memory.

2/23/10 Result A 23-times speedup with less than 0.1% loss of accuracy (for M=100) A 60-times speedup with less than 5% loss of accuracy (M=30) It actually improves accuracy (by very little: +0.06%) and still speeds up by a factor of 2.3 (M=500).

2/23/10 Future work More systematic trials: variance resulting from randomization Protect against: imbalanced training sets concentration: too many samples in too few bins

2/23/10 Thank you!

2/23/10

Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.

Similar presentations

Presentation on theme: "Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.

Similar presentations

Presentation on theme: "Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer."— Presentation transcript:

Similar presentations

About project

Feedback