Presentation is loading. Please wait.

Presentation is loading. Please wait.

Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman.

Similar presentations


Presentation on theme: "Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman."— Presentation transcript:

1 Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman

2 Overview ► What we did ► How we did it ► Results ► Why does this matter ► Conclusions ► Questions?

3 What did we do? ► Compared document classification accuracy of three pieces of software on data from 20 newsgroups  Rainbow (Naïve Bayes)  C4.5 (Decision Tree)  Neural Network (Back-propagation) ► Initially planned on taking a single document and locating other documents similar to it

4 How did we do it?. ► Used Rainbow as benchmark  Used it to create a model of the data  Was trained and tested with a common set of data ► Used perl scripts to separate the data into training/testing sets and create input files for C4.5 and the neural network software  Rainbow's ability to output word counts for the top N words was used to create the input files  Initially wanted to use word probabilities, but it is only capable of doing this with classes, not single documents

5 .How did we do it? ► Modified image neural network from previous assignment so that it would look at documents instead of images  Needed to have 20 output nodes, one for each newsgroup  Took in 1000 words (initially at least)  Started with the default hidden nodes (4) and used all the way up to approximately 2000 (2x the number of inputs) ► http://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-10.html

6 Results ► The Decision Tree software was able to get between 15% and 40% accuracy (depending on whether the tree was pruned and using test data)  Training set was about 17% after pruning  Test set was about 40% after pruning ► Neural Network proved to be much more difficult than we at first thought  Very very slow (on full training data, took approximately 1 hour per epoch on a 1.2Ghz Linux machine)  Accuracy did not increase over many trials  Spent a great amount of time experimenting with the various paramaters ► Learning Rate, Momentum, Hidden Units  Never got better than about 5% accuracy

7 .Results. ► Rainbow  Approximately 80% accuracy ► C4.5 and Rainbow made similar errors:  Misclassified documents within the similar groups: ► Alt.atheism, talk.religion.misc, talk.politics.misc ► Comp.*

8 Why is text classifcation important? ► Spam detection ► General mail filtering into folders ► Automatically place documents in file system at proper location

9 Conclusions ► Naïve Bayes seems to empirically be the best for classifying documents  At least for newsgroup data ► Still made similar errors to C4.5 which used only word counts ► If we had pre-processed the data better, perhaps removing outliers and normalizing the information then we could have gotten better results with the Neural Network  Word counts not enough to “specify” a document, C4.5 seemed to create a tree that did not generalize well to the test data ► Neural Networks are definitely not “plug and chug,” every application is specific and needs specific parameters  Hard to know how much data to use, or how many features. ► Most people don’t have 10000 emails to “train” with  Should investigate a threshold minimum for getting accurate results

10 Fin. ► Questions?


Download ppt "Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman."

Similar presentations


Ads by Google