Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Slides:



Advertisements
Similar presentations
An Introduction To Categorization Soam Acharya, PhD 1/15/2003.
Advertisements

Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Large-Scale Entity-Based Online Social Network Profile Linkage.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Bring Order to Your Photos: Event-Driven Classification of Flickr Images Based on Social Knowledge Date: 2011/11/21 Source: Claudiu S. Firan (CIKM’10)
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Information Retrieval Review
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Scalable Text Mining with Sparse Generative Models
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Andreea Bodnari, 1 Peter Szolovits, 1 Ozlem Uzuner 2 1 MIT, CSAIL, Cambridge, MA, USA 2 Department of Information Studies, University at Albany SUNY, Albany,
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
Web-page Classification through Summarization D. Shen, *Z. Chen, **Q Yang, *H.J. Zeng, *B.Y. Zhang, Y.H. Lu and *W.Y. Ma TsingHua University, *Microsoft.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Open Information Extraction using Wikipedia
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Flickr Tag Recommendation based on Collective Knowledge BÖrkur SigurbjÖnsson, Roelof van Zwol Yahoo! Research WWW Summarized and presented.
Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
Class Imbalance in Text Classification
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Improving Music Genre Classification Using Collaborative Tagging Data Ling Chen, Phillip Wright *, Wolfgang Nejdl Leibniz University Hannover * Georgia.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Linked Data Profiling Andrejs Abele UNLP PhD Day Supervisor: Paul Buitelaar.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Named entities recognition Jana Kravalová. Content 1. Task 2. Data 3. Machine learning 4. SVM 5. Evaluation and results.
Sentiment analysis algorithms and applications: A survey
Cross-language Information Retrieval
Extracting Semantic Concept Relations
Presented by: Prof. Ali Jaoua
Text Categorization Assigning documents to a fixed set of categories
Discovering Emerging Entities with Ambiguous Names
Dynamic Category Profiling for Text Filtering and Classification
Introduction to Sentiment Analysis
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09

Motivation  Classify tags in Flickr as broad categories such as what, where, when and who  Easier indexing and navigation  WordNet is usually used for classification but has limited coverage

Example

The ClassTag System

Classifying Wikipedia Articles  Using only metadata (i.e. Categories and Templates) – high scalability  Supervised Classifier  Articles as objects  WordNet noun semantic categories as classification classes  Categories and Templates as features  Support Vector Machine (SVM) as classifier

Categories and Templates

Supervised Classification  Ground Truth  All Wikipedia articles that match WordNet nouns  Data Sparsity  WordNet categories under represented (10 out of 25)  Articles have very few features

Reducing Data Sparsity  Using category and template network transclusion  … but noise is added

System Optimization  Number of arcs traversed in  Category network  Template network  Choice of weighting function  Term Frequency (tf)  Term Frequency – Inverse Document Frequency (tf-idf)  Term Frequency – Inverse Layer (tf-il)

Example

Fine Tuning  Partitioned the ground truth into training and test sets  Criteria  At least 80% precision  Maximum possible recall  Resulted optimal values  Category arcs: 3, Template arcs: 3, TF-IL  Precision: 87% F1-Measure: 0.696

SVM Threshold  SVM outputs confidence with which an article is correctly classified as a member of a category  Training experiment with 250 Wikipedia articles (1 assessor)

SVM Threshold

Summary  Optimised for Recall (ClassTag)  39% of Articles classified  664,770 Wikipedia articles  Optimised for Precision (ClassTag+)  21% of Articles classified  338,061 Wikipedia articles

Comparison with DBpedia Experimental Setup – 300 pooled articles – 3 Assessors – Blind Assessments – 50 articles overlap Partial Agreement: – 86% Total Agreement: – 78%

Results

Classification of Flickr Tags  Tag  Anchor Text  String matching  Anchor Text  Wikipedia Article  Number of times an anchor refers to a Wikipedia article  Wikipedia Article  Category  Output of SVM decision

Ambiguity  Tag  Anchor Text  Some ambiguity because often tags are lower case with no white spaces  Anchor Text  Wikipedia Article  13.4% of Anchor text -> Wikipedia Article mappings ambiguous  4% of Anchor text -> Category mappings ambiguous  Example  George Bush -> George W. Bush, George Bush Senior  George Bush -> Person  Wikipedia Article  Category  5.7% of classified articles result in multiple classification

Example

Evaluation  WordNet classification extended vocabulary coverage by 115%  Taking tag frequency into account  ClassTag classified 69.2% of Flickr tags  22% more than WordNet baseline

Tag distribution

Multilanguage Classification  80% of tags in English, 7% in German and 6% in Dutch  Maybe a portion of the unclassified tags fall into this category  Possible alternate language classification  Run ClassTag using alternate Wikipedia language and a corresponding lexicon  Translate the English classification using Wikipedia’s interlanguage links

Contributions  Classifying open content resources using their structural patterns  Presenting ClassTag - a system for classifying tags  ClassTag extends the WordNet lexicon using the structural patterns of Wikipedia

Conclusion  Tuneable system for classifying Wikipedia pages  ClassTag: Nearly 40% of articles classified with a precision of 72%  ClassTag+: 21% of articles classified with a precision of 86% (equal to assessor agreement)  Nearly 70% of Flickr tags matched to WordNet categories