Presentation is loading. Please wait.

Presentation is loading. Please wait.

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.

Similar presentations

Presentation on theme: "Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li."— Presentation transcript:

1 Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li

2  Introduction  Feature Generation with Wikipedia ◦ Wikipedia as a knowledge Repository ◦ Feature Construction ◦ Feature generator design ◦ Using the link structure  Empirical Evaluation ◦ Implementation Details ◦ Experimental Methodology ◦ The effect of feature generation ◦ Classifying short documents  Conclusions and Future Work

3  Text categorization ◦ Deals with automatic assignment of category labels to natural language documents ◦ Represent document as bags of words ◦ Features from words ◦ Categorization based on features ◦ Limitation of BOW:  by individual word occurrences in the training set  Wal-Mart supply chain goes real time  Wal-Mart manages its stock with RFID technology  Effective in medium difficulty categorization, but bad in small categories or short documents  Using encyclopedia to endow the machine document with the broader of knowledge available to humans

4  Auxiliary text classifier: ◦ matching documents with the most relevant articles of wikipedia ◦ conventional bag of words + new features  Examples for idea of auxiliary text classifier: ◦ “ Bernanke takes charge ” ◦ BEN BERNANKE, FEDERAL RESERVE, CHAIRMAN OF THE FEDERAL RESERVE, ALAN GREENSPAN, MONETARISM, …  Using wikipedia ◦ Use text similarity algorithms to automatically identify encyclopedia articles relevant to each document ◦ Leverage the knowledge gained from these articles

5  Extend the representation of documents for text categorization with knowledge concepts relevant to the document text.  Wikipedia ◦ Largest knowledge repository ◦ Large-scale hierarchies ◦ Qualify, stander written English ◦ …

6  Receive a text fragment, and map to most relevant wikipedia articles ◦ E.g. overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with Encyclopedia knwoledge ◦ ENCYCLOPEID, WIKIPEDIA, ENTERPRISE CONTENT MANAGEMENT, BOTTELENECK, PERFORMANCE PROBLEM, HERMENEUTICS  Training documents -> features -> wikipedia concepts -> augment the bag of word

7  Unit for feature generation? ◦ Word, sentence, paragraph, document?  Multi-resolution approach ◦ Features are generated for  Individual words  Sentences  Paragraphs  Entire document ◦ Polysemous words is mapped to the concepts that correspond to the sense shared by the context words

8  “jaguar car models”,  the Wikipedia-based feature generator returns: ◦ JAGUAR (CAR), ◦ DAIMLER and BRITISH LEYLAND MOTOR CORPORATION (companies merged with Jaguar), ◦ V12 (Jaguar’s engine), ◦ JAGUAR E-TYPE ◦ JAGUAR XJ.  “jaguar Panthera onca”, ◦ JAGUAR, ◦ FELIDAE (feline species family), related felines such as LEOPARD, ◦ PUMA and BLACK PANTHER, as well as KINKAJOU

9  A set of simple heuristics for pruning the sets of concepts (wikipedia): ◦ Discarding:  with <100 non stop words  <5 incoming and outgoing links (too short)  disambiguation pages ◦ Each concept is an attribute vector assigned weights using a TF.IDF

10  Link—anchor text: ◦ Identical to the canonical name of the target article ◦ Different anchor text refer to the same article: alternative names, variant spellings, and related phrases ◦ Incoming links: significance of an article ◦ Problem: taking all articles pointed from a concept: ill-advised, a lot of weakly related material ◦ Pursue this direction in future work

11  Wikipedia snapshot: November 5, 2005  1.8Gb text in 910,989 articles, ◦ removing small and overly specific concepts -- remaining 171,332 articles ◦ Removing stop words and rare words ◦ Stemmed ◦ 296,157 distinct terms presenting concepts

12  1 Reuter-21578  2 Reuters Corpus Volume I (RCV1)  3 OHSUMED  4 20 Newsgroups(20NG)  5 Movie Reviews (Movies)  Method: SVM with a linear kernel  Metrics: ◦ precision-recall break-even point (BEP) ◦ Reuter and OHSUMED: micro- and macro-average BEP ◦ 20 NG and Movies: 4-fold cross-validation

13 Improve more More effective in small categories

14 Only use title of the articles to do classification

15  Feature generator: ◦ identify the most relevant encyclopedia articles ◦ Creating new features  Add semantics to conventional BOW ◦ Latent semantic indexing ◦ LSI + SVM: not good ◦ Wikipedia +svm: improve  Information retrieval

Download ppt "Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li."

Similar presentations

Ads by Google