Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Similar presentations


Presentation on theme: "Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour"— Presentation transcript:

1 Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi (m.mohaqeqi@ece.ut.ac.ir) Reza Soltanpour (rsoltanpoor@yahoo.com) Azadeh Shakeri (shakery@ece.ut.ac.ir) ECE Department, University of Tehran, Tehran, Iran.

2 Agenda Problem Definition Introduction to Concept Graph C.G. Aided Classification A Sample Implementation Assessment Conclusion 2 Classification of Unknown Documents by Concept Graph

3 Problem Definition 3 Classification of Unknown Documents by Concept Graph Training set Representative words for each class c1 c2 cncn Implicit assumption: Training set ~ Test set Automatic classification Feature selection Test set Dependent on the training set

4 An Overview of the Solution 4 Classification of Unknown Documents by Concept Graph Training set Representative words for each class c1 c2 cncn Our assumption: Training set ≠ Test set Automatic classification Feature selection Test set Concept Graph Feature Enrichment

5 Agenda Problem Definition Concept Graph C.G. Aided Classification A Sample Implementation Assessment Conclusion 5 Classification of Unknown Documents by Concept Graph

6 Concept Graph: Definition Definition: A weighted graph in which the nodes are terms and edges are the semantic relationship between the terms Application: keyword suggestion, query expansion Representative Vector: The list of most related words to a specific term in the concept graph 6 Classification of Unknown Documents by Concept Graph Playerweight Coach.0102 Playground.0077 Football.0069 Newspaper.0056 Club.0052 Team.0046 ……

7 Concept Graph: Construction method NLP based methods: accurate but costly Statistical methods: language independent Computationally efficient Recursive vector creation method: at the basis of a rich corpora: e.g. wikipedia 7 Classification of Unknown Documents by Concept Graph

8 Agenda Problem Definition Concept Graph C.G. Aided Classification A Sample Implementation Assessment Conclusion 8 Classification of Unknown Documents by Concept Graph

9 An Overview of the Solution 9 Classification of Unknown Documents by Concept Graph Training set Representative words for each class c1 c2 cncn Our assumption: Training set ≠ Test set Automatic classification Feature selection Test set Concept Graph Feature Enrichment

10 Concept Graph Aided method Classification of Unknown Documents by Concept Graph 10 Select the features from training set (base set) Select top n features for each class Normalize the step 4’s terms & add them to the “base set” 5 Extract most frequently terms in vectors 4 1 2 Create rep. vector for each of those top n features 3 Classify the documents from a new resource 6 Training Phase

11 A Sample Implementation Training set: Hamshahri: 1997-2002 (166,000 documents) Concept Graph Resource: ISNA: 1997-2007 (500,000 documents) Test set: Keyhan: 2007-2008 (3700 documents) 4 classes: Classification of Unknown Documents by Concept Graph 11 Sports Economy Politics Science

12 Step 1. Feature Selection Mutual Information (MI): measures how much information the presence/absence of a term contributes to making the correct classification decision on c. Classification of Unknown Documents by Concept Graph 12 Feature Selection from the training set Hamshahri: 1997-2002 (166000 docs) Sports Features Economy Features Politics Features Science Features Selected features:

13 Step 2, 3. Rep. Vector Construction Economy Price change Rena ChimyDaroo Chokopars carton Document Sepanta DarooPakhsh tire Lamiran Classification of Unknown Documents by Concept Graph 13 Select top 10 features for each class 2 Extract the representative vector for each term 3 Price change Capital Iran National Income … Rena Income Country Capital Iran … Document Income Country Capital Iran … Chokopars Income Country Capital Iran … … … … … … Economy Features Candidate words

14 Step 4. Refine the Rep Vectors 1 if vector f contains t I (t, vector f ) = 0 otherwise term frequency in vectors(tfv t ): Classification of Unknown Documents by Concept Graph 14 Most frequency words in the vectors 4 tfvterm 7Cqapital 6Iran 5development 4company 4industrial 4economic 3strategy

15 Step 5, 6. Feature Normalization, Classification Multinomial Naive Bayes as the base: in which P(t k |c) is the conditional probability of occurrence of term t in class c Classification of Unknown Documents by Concept Graph 15 Normalize the step 4’s terms & add them to the “base set” 5 Classify the documents from a new resource 6

16 Assessment: Classification of Unknown Documents by Concept Graph 16 Total recall Total precision Avg. Recall Avg. Precision Without enrichment 0.520.780.490.70 With enrichment 0.640.780.570.71 Performance: Recall: Unclassified documents Without enrichment 1219 With enrichment 680

17 Assessment: Classification of Unknown Documents by Concept Graph 17 Performance comparison with a Persian classifier : Total recall Total precision 4-gram0.780.68 With enrichment0.640.78

18 Conclusion and future work Classification of Unknown Documents by Concept Graph 18 We proposed a classification method in which:  is not dependent on the training set  improves the classification recall  has little impact on the performance  is somehow language independent

19 Conclusion and future work Classification of Unknown Documents by Concept Graph 19 However there are some subtleties:  The concept graph suggests very general words  The normalization phase must be done precisely  This version of concept graph works only with single words (e.g. economic development is considered as two separate phrases)

20 Conclusion and future work Classification of Unknown Documents by Concept Graph 20 future works:  Implementing the method using several classification and feature selection algorithms  Study the negative impact of Farsi language problems in the method (we believe this is not so much)  Usage of a richer corpora (e.g. Farsi Wikipedia) for C.G. construction

21 Discussion & Question 21 Classification of Unknown Documents by Concept Graph

22 Basic Classification Algorithm 22 Classification of Unknown Documents by Concept Graph Finding the best class for a given document Multinomial Naive Bayes as the base: in which P(t k |c) is the conditional probability of occurance pf term t in class c

23 Feature selection Extracting the features MI: 23 Classification of Unknown Documents by Concept Graph


Download ppt "Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour"

Similar presentations


Ads by Google