Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Introduction to Information Retrieval
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Current and Future Research Directions University of Tehran Database Research Group 1 October 2009 Abolfazl AleAhmad, Ehsan Darrudi, Hadi.
Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19,
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.
Assuming normally distributed data! Naïve Bayes Classifier.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Optimizing Text Classification Mark Trenorden Supervisor: Geoff Webb.
Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.
Mapping Between Taxonomies Elena Eneva 27 Sep 2001 Advanced IR Seminar.
Text Classification: An Implementation Project Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo.
Vector Space Model CS 652 Information Extraction and Integration.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Bayesian Networks. Male brain wiring Female brain wiring.
ISSPA January 1 N -Gram and Local Context Analysis for Persian text retrieval Tehran University Abolfazl AleAhmad, Parsia Hakimian, Farzad Mahdikhani.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Text Classification, Active/Interactive learning.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
1 Query Operations Relevance Feedback & Query Expansion.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er-Gang Liu Advisor : Dr. Jia-ling Koh 1.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.
Selecting Good Expansion Terms for Pseudo-Relevance Feedback Guihong Cao, Jian-Yun Nie, Jianfeng Gao, Stephen Robertson 2008 SIGIR reporter: Chen, Yi-wen.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Automating Readers’ Advisory to Make Book Recommendations for K-12 Readers by Alicia Wood.
Naïve Bayes Classification Christina Wallin Computer Systems Research Lab
Marko Grobelnik, Janez Brank, Blaž Fortuna, Igor Mozetič.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Department of Computer Science The University of Texas at Austin USA Joint Entity and Relation Extraction using Card-Pyramid Parsing Rohit J. Kate Raymond.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Feasibility of Using Machine Learning Algorithms to Determine Future Price Points of Stocks By: Alexander Dumont.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
KNN & Naïve Bayes Hongning Wang
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
Queensland University of Technology
Semantic Processing with Context Analysis
Document Classification Method with Small Training Data
and Knowledge Graphs for Query Expansion Saeid Balaneshinkordan
Natural Language Processing of Knee MRI Reports
Inf 722 Information Organisation
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Enriching Taxonomies With Functional Domain Knowledge
Presentation transcript:

Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour Azadeh Shakeri ECE Department, University of Tehran, Tehran, Iran.

Agenda Problem Definition Introduction to Concept Graph C.G. Aided Classification A Sample Implementation Assessment Conclusion 2 Classification of Unknown Documents by Concept Graph

Problem Definition 3 Classification of Unknown Documents by Concept Graph Training set Representative words for each class c1 c2 cncn Implicit assumption: Training set ~ Test set Automatic classification Feature selection Test set Dependent on the training set

An Overview of the Solution 4 Classification of Unknown Documents by Concept Graph Training set Representative words for each class c1 c2 cncn Our assumption: Training set ≠ Test set Automatic classification Feature selection Test set Concept Graph Feature Enrichment

Agenda Problem Definition Concept Graph C.G. Aided Classification A Sample Implementation Assessment Conclusion 5 Classification of Unknown Documents by Concept Graph

Concept Graph: Definition Definition: A weighted graph in which the nodes are terms and edges are the semantic relationship between the terms Application: keyword suggestion, query expansion Representative Vector: The list of most related words to a specific term in the concept graph 6 Classification of Unknown Documents by Concept Graph Playerweight Coach.0102 Playground.0077 Football.0069 Newspaper.0056 Club.0052 Team.0046 ……

Concept Graph: Construction method NLP based methods: accurate but costly Statistical methods: language independent Computationally efficient Recursive vector creation method: at the basis of a rich corpora: e.g. wikipedia 7 Classification of Unknown Documents by Concept Graph

Agenda Problem Definition Concept Graph C.G. Aided Classification A Sample Implementation Assessment Conclusion 8 Classification of Unknown Documents by Concept Graph

An Overview of the Solution 9 Classification of Unknown Documents by Concept Graph Training set Representative words for each class c1 c2 cncn Our assumption: Training set ≠ Test set Automatic classification Feature selection Test set Concept Graph Feature Enrichment

Concept Graph Aided method Classification of Unknown Documents by Concept Graph 10 Select the features from training set (base set) Select top n features for each class Normalize the step 4’s terms & add them to the “base set” 5 Extract most frequently terms in vectors Create rep. vector for each of those top n features 3 Classify the documents from a new resource 6 Training Phase

A Sample Implementation Training set: Hamshahri: (166,000 documents) Concept Graph Resource: ISNA: (500,000 documents) Test set: Keyhan: (3700 documents) 4 classes: Classification of Unknown Documents by Concept Graph 11 Sports Economy Politics Science

Step 1. Feature Selection Mutual Information (MI): measures how much information the presence/absence of a term contributes to making the correct classification decision on c. Classification of Unknown Documents by Concept Graph 12 Feature Selection from the training set Hamshahri: ( docs) Sports Features Economy Features Politics Features Science Features Selected features:

Step 2, 3. Rep. Vector Construction Economy Price change Rena ChimyDaroo Chokopars carton Document Sepanta DarooPakhsh tire Lamiran Classification of Unknown Documents by Concept Graph 13 Select top 10 features for each class 2 Extract the representative vector for each term 3 Price change Capital Iran National Income … Rena Income Country Capital Iran … Document Income Country Capital Iran … Chokopars Income Country Capital Iran … … … … … … Economy Features Candidate words

Step 4. Refine the Rep Vectors 1 if vector f contains t I (t, vector f ) = 0 otherwise term frequency in vectors(tfv t ): Classification of Unknown Documents by Concept Graph 14 Most frequency words in the vectors 4 tfvterm 7Cqapital 6Iran 5development 4company 4industrial 4economic 3strategy

Step 5, 6. Feature Normalization, Classification Multinomial Naive Bayes as the base: in which P(t k |c) is the conditional probability of occurrence of term t in class c Classification of Unknown Documents by Concept Graph 15 Normalize the step 4’s terms & add them to the “base set” 5 Classify the documents from a new resource 6

Assessment: Classification of Unknown Documents by Concept Graph 16 Total recall Total precision Avg. Recall Avg. Precision Without enrichment With enrichment Performance: Recall: Unclassified documents Without enrichment 1219 With enrichment 680

Assessment: Classification of Unknown Documents by Concept Graph 17 Performance comparison with a Persian classifier : Total recall Total precision 4-gram With enrichment

Conclusion and future work Classification of Unknown Documents by Concept Graph 18 We proposed a classification method in which:  is not dependent on the training set  improves the classification recall  has little impact on the performance  is somehow language independent

Conclusion and future work Classification of Unknown Documents by Concept Graph 19 However there are some subtleties:  The concept graph suggests very general words  The normalization phase must be done precisely  This version of concept graph works only with single words (e.g. economic development is considered as two separate phrases)

Conclusion and future work Classification of Unknown Documents by Concept Graph 20 future works:  Implementing the method using several classification and feature selection algorithms  Study the negative impact of Farsi language problems in the method (we believe this is not so much)  Usage of a richer corpora (e.g. Farsi Wikipedia) for C.G. construction

Discussion & Question 21 Classification of Unknown Documents by Concept Graph

Basic Classification Algorithm 22 Classification of Unknown Documents by Concept Graph Finding the best class for a given document Multinomial Naive Bayes as the base: in which P(t k |c) is the conditional probability of occurance pf term t in class c

Feature selection Extracting the features MI: 23 Classification of Unknown Documents by Concept Graph