Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.

Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010

Introduction Classifying documents Will use a Bayesian method and calculate conditional probability Use a set of Training Documents Choose a set of features for each category Coding in Java

Background Naïve Bayes Classifier/Bayesian Method computes the conditional probability p(T|D) for a given document D for every topic Assigns the document D to the topic with the largest conditional probability http://nltk.googlecode.com/svn/t runk/doc/book/ch06.html

Background Program has two steps: Learning Prediction

Learning Will be using training documents conditional probability features selection based on how often terms appear in certain documents http://www.dot.state.mn.us/consult/i mages/j0341469.jpg

Prediction Prediction Predicting what a unknown document is talking about based on the learning section http://www.deafsports.co.nz/WebImages/ documents.jpg

Development Created Category, Document, Terms classes – Category class deals with the categories – Document class deals with the documents – Terms class deals with terms that appear in each document

Category Each category contains an array of documents My categories started out with tennis and other Added more categories as my program started working

Document Class Each document contains an array of terms. The documents were my training documents

Terms Class Terms class dealt with all the terms that appeared in the training documents For each term, an array of counts on the number of times the term appears in documents – Counts for each category Also, each term is assigned a score – Score = number of times in category A + 1/number of times in category B + 1 to avoid dividing by 0 – Method to calculate the score varied as my program developed Terms

Development (continued) Creates an array of categories Reads in all my training documents Stores all the terms that appear in an array of Terms Sorts the array of terms based on the score for each category Chose the top 25 terms from the sorted array based on each category

Development (continued) What I still need to do: – Test my program's learning and write the prediction part – Once my program works for two categories, add more categories http://www.filibeto.org/sun/lib/nonsu n/oracle/11.1.0.6.0/B28359_01/text. 111/b28303/img/ccapp018.gif

Expected Results The more training documents, the better the results will likely be In addition, different ways of calculating score will likely produce different results May play around with that Expected results

Discussion Once my program starts running and working correctly, I will discuss the results I have finished the Learning part of the program, but now I need to do the Prediction part

Works Cited http://www.nltk.org/book My dad Chai, Kian Ming Adam, Hai Leong Chieu, and Hwee Tou Ng. ACM Poral. Assocation of Computing Machinery, 2002. Web. 14 Jan. 2010..

Works Cited (continued) Eyheramendy, Susana, and David Madigan. "A Flexible Bayesian Generalized Linear Model for Dichotomous Response Data with an Application to Text Categorization." Lecture Notes-Monograph Series 54 (2007): 76-91. JSTOR. Web. 25 Oct. 2009.. Lavine, Michael, and Mike West. "A Bayesian Method for Classification and Discrimination." Canadian Journal of Statistics 20.4 (1992): 451-461. JSTOR. Web. 14 Jan. 2010..

Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.

Similar presentations

Presentation on theme: "Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010.

Similar presentations

Presentation on theme: "Learning to Classify Documents Edwin Zhang Computer Systems Lab 2009- 2010."— Presentation transcript:

Similar presentations

About project

Feedback