Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Similar presentations


Presentation on theme: "Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09."— Presentation transcript:

1 Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09

2 Motivation  Classify tags in Flickr as broad categories such as what, where, when and who  Easier indexing and navigation  WordNet is usually used for classification but has limited coverage

3 Example

4 The ClassTag System

5 Classifying Wikipedia Articles  Using only metadata (i.e. Categories and Templates) – high scalability  Supervised Classifier  Articles as objects  WordNet noun semantic categories as classification classes  Categories and Templates as features  Support Vector Machine (SVM) as classifier

6 Categories and Templates

7

8 Supervised Classification  Ground Truth  All Wikipedia articles that match WordNet nouns  Data Sparsity  WordNet categories under represented (10 out of 25)  Articles have very few features

9 Reducing Data Sparsity  Using category and template network transclusion  … but noise is added

10 System Optimization  Number of arcs traversed in  Category network  Template network  Choice of weighting function  Term Frequency (tf)  Term Frequency – Inverse Document Frequency (tf-idf)  Term Frequency – Inverse Layer (tf-il)

11 Example

12 Fine Tuning  Partitioned the ground truth into training and test sets  Criteria  At least 80% precision  Maximum possible recall  Resulted optimal values  Category arcs: 3, Template arcs: 3, TF-IL  Precision: 87% F1-Measure: 0.696

13 SVM Threshold  SVM outputs confidence with which an article is correctly classified as a member of a category  Training experiment with 250 Wikipedia articles (1 assessor)

14 SVM Threshold

15

16 Summary  Optimised for Recall (ClassTag)  39% of Articles classified  664,770 Wikipedia articles  Optimised for Precision (ClassTag+)  21% of Articles classified  338,061 Wikipedia articles

17 Comparison with DBpedia Experimental Setup – 300 pooled articles – 3 Assessors – Blind Assessments – 50 articles overlap Partial Agreement: – 86% Total Agreement: – 78%

18 Results

19 Classification of Flickr Tags  Tag  Anchor Text  String matching  Anchor Text  Wikipedia Article  Number of times an anchor refers to a Wikipedia article  Wikipedia Article  Category  Output of SVM decision

20 Ambiguity  Tag  Anchor Text  Some ambiguity because often tags are lower case with no white spaces  Anchor Text  Wikipedia Article  13.4% of Anchor text -> Wikipedia Article mappings ambiguous  4% of Anchor text -> Category mappings ambiguous  Example  George Bush -> George W. Bush, George Bush Senior  George Bush -> Person  Wikipedia Article  Category  5.7% of classified articles result in multiple classification

21 Example

22 Evaluation  WordNet classification extended vocabulary coverage by 115%  Taking tag frequency into account  ClassTag classified 69.2% of Flickr tags  22% more than WordNet baseline

23 Tag distribution

24 Multilanguage Classification  80% of tags in English, 7% in German and 6% in Dutch  Maybe a portion of the unclassified tags fall into this category  Possible alternate language classification  Run ClassTag using alternate Wikipedia language and a corresponding lexicon  Translate the English classification using Wikipedia’s interlanguage links

25 Contributions  Classifying open content resources using their structural patterns  Presenting ClassTag - a system for classifying tags  ClassTag extends the WordNet lexicon using the structural patterns of Wikipedia

26 Conclusion  Tuneable system for classifying Wikipedia pages  ClassTag: Nearly 40% of articles classified with a precision of 72%  ClassTag+: 21% of articles classified with a precision of 86% (equal to assessor agreement)  Nearly 70% of Flickr tags matched to WordNet categories


Download ppt "Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09."

Similar presentations


Ads by Google