Presentation is loading. Please wait.

Presentation is loading. Please wait.

Document Filtering Social Web 3/17/2010 Jae-wook Ahn.

Similar presentations


Presentation on theme: "Document Filtering Social Web 3/17/2010 Jae-wook Ahn."— Presentation transcript:

1 Document Filtering Social Web 3/17/2010 Jae-wook Ahn

2 Classification Problem
Put an item into a specific category Spam filtering — spam or no-spam Topic categorization Recommendation — interested or not- interested

3 Classification — Methods
Rule-based E.g. Lot’s of capital letters Spammers learn too Customizability Statistical learning

4 Classification — Learning
Training Data Target Data Doc (Spam) Doc (Spam) Doc (Nospam) Doc Doc (Nospam) Train/Learn Classify/Filter Classifier 4

5 Tokenization Text to word Word to dictionary Why dictionary?
Word to frequency (or occurrence ) look-up

6 Training “Train” classifier with example texts
= calculate frequency distribution class classifier: (pp. 119) fc : feature → category (frequency) cc: category → frequency getfeatures — tokenizer (plugin any method)

7 Probability Calculation
Pr(word|classification) Ex. Pr(“drug”|spam) = 80 docs / total 100 spam docs = 0.8

8 Weighted Probability Doc1[… money …](s), Doc2[ … money …](s), Doc3[ … money …](s), Doc4[……](s), Doc5[……](ns) Pr(“money”|spam) = 3/4 = 0.75 Pr(“money”|no-spam) = 0/1 = 0 Pr = 0.5 (we don’t know) may be better than Pr = 0 (never) Ex. After finding one spam instance

9 Naive Bayesian Classifier
Goal = Pr(Category|Document) Ex. Pr(Spam|Doc1) = 0.001, Pr(No- spam|Doc1) = 0.5 → Doc1 = No-pam What we have is? = Pr(Feature|Category) Process = Pr(Feature|Category) → Pr(Document|Category) → Pr(Category|Document)

10 Pr(Document|Category)
Pr(Document|Category) = Pr(Feature1|Cat) * Pr(Feature2|Cat) * Pr(Feature3|Cat) … Pr(FeatureN|Cat) Pr(A ^ B) = Pr(A) * Pr(B) Assumption — A and B are independent from each other Not true — social vs. Web, social vs. Probability But still useful

11 Pr(Category|Document)
Pr(A|B) = Pr(B|A) * Pr(A) / Pr(B) Thomas Bayes Pr(Category|Document) = Pr(Document|Category) * Pr(Category) / Pr(Document) Pr(Category) = # of docs in Cat / total # of docs Pr(Document) = Constant

12 Choosing a Category Take one with the highest probability
What if, Pr(Spam|Doc) = , Pr(No- spam|Doc) = Answer may be “Not sure”

13 Choosing a Category Thresholding
If Pr(Spam|Doc) > 3 * Pr(No-spam|Doc), Then spam → which is more reasonable

14 Persisting Trained Classifier
Classifier so far, Dictionaries in memory — fc, cc Disappears after quitting from Python interpreter Should be saved to disc MySQL — client/server RDBMS SQLite — file-based RDBMS

15 Persisting Trained Classifier
Python shelve Put/Get any Python object into disk files

16 Persisting Trained Classifier
DBM, GDBM, BSDDB Unix database interface and its successors Disk-based dictionary GDBM — GNU dbm BSDDB — Berkeley DB Hash, B-Tree

17 Improved Features So far, features = words
Phrases (n-gram) — “social web”, “spam filter”, etc. Attribute — Has_Many_Uppercases = True

18 Alternative Methods Supervised learning methods Neural network
Support Vector Machine Decision Tree Software packages Weka, R, SPSS Clementine, etc

19 Weka Example Example Data Weather condition → To play or not to play?
4 attributes, 1 class variable

20 Weka Example

21 Weka Example

22 Weka Example

23 Parsing RSS Feeds Problem — extract texts from RSS structure
They are XML Parsers SAX DOM Out-of-box parser

24 SAX and DOM SAX (Simple API for XML) — serial access parser
Stream of XML data goes in Event-driven parsing DOM (Document Object Model) Use hierarchical structure for parsing

25 SAX Example

26 DOM Example

27 Ready-made Parser Universal Feed Parser <

28 Universal Feedparser

29 Core Attributes Follows RSS/ATOM syntax normalization
However, not always updated /atom10:feed/atom10:updated /atom03:feed/atom03:modified /rss/channel/pubDate /rss/channel/dc:date /rdf:RDF/rdf:channel/dc:date /rdf:RDF/rdf:channel/dcterms:modified

30 Advanced features Date parsing HTML sanitization Content normalization
Namespace handling and more...

31 Date Parsing Parses various date formats to Python 9- tuples

32 Summary Document filtering — classification problem
Statistical learning-based methods RSS parsing — XML-parsers, RSS parsers


Download ppt "Document Filtering Social Web 3/17/2010 Jae-wook Ahn."

Similar presentations


Ads by Google