Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Bangalore

Similar presentations


Presentation on theme: "Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Bangalore"— Presentation transcript:

1 Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Bangalore IndiaRohini.uppuluri@corp.aol.com Vamshi Ambati Language Technologies Institute Carnegie Mellon University Pittsburgh, USA vamshi@cs.cmu.edu

2 Agenda Keyphrase Extraction Keyphrase Extraction Value addition to Digital Libraries Value addition to Digital Libraries Methods of Keyphrase Extraction Methods of Keyphrase Extraction Related Work Related Work Our Solution Our Solution

3 What are Keyphrases? Keyphrases Keyphrases (Give example) (Give example) Where used? Where used? Cataloguing in Libraries for IR purposes Cataloguing in Libraries for IR purposes Quick Summarization of documents Quick Summarization of documents

4 Why important to ULIB? Vast growth in digital content Vast growth in digital content More than a Million books! More than a Million books! Short Meta data description – useful to user while reading Short Meta data description – useful to user while reading For further processing of books like summarization, IR etc For further processing of books like summarization, IR etc

5 How do we extract KPs? Manual entry Manual entry Reliable, high quality outcome Reliable, high quality outcome But, time-consuming, expensive But, time-consuming, expensive Automatic Automatic Fast extraction but less reliable Fast extraction but less reliable No expense at all No expense at all

6 Automatic techniques for KPE Rule based methods Rule based methods Heuristics (paragraph beginning, headline etc) Heuristics (paragraph beginning, headline etc) Krulwich &Burkey etc Krulwich &Burkey etc Using Linguistic tools Using Linguistic tools Statistical techniques Statistical techniques Term counts and weighting based Methods Term counts and weighting based Methods Learn model from training data Learn model from training data Turney et. al[5], KEA[6], KSpotter[3] etc Turney et. al[5], KEA[6], KSpotter[3] etc

7 Requirements for a KPE for ULIB Automatic Identification of Keyphrases from chapters of books Automatic Identification of Keyphrases from chapters of books Language independent Language independent Easily adaptable for different domains Easily adaptable for different domains No training data to learn from No training data to learn from Most books in ULIB do not have keywords as part of the metadata Most books in ULIB do not have keywords as part of the metadata

8 Solution Outline Language Modeling based Language Modeling based Given n-grams Given n-grams Measure Informativeness, Phraseness Measure Informativeness, Phraseness Score n-grams based on the above measures Score n-grams based on the above measures Pick top K phrases as Keyphrases Pick top K phrases as Keyphrases

9 Extracting Keyphrases from Books Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning Extracted Keyphrases Text

10 Extracting Keyphrases from Books Topics are also used to construct user profiles via explicit specication of interests or automatic analysis of Web pages visited Extracted Keyphrases Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning topics construct user profiles explicit specification interests automatic analysis web pages visited

11 Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited Extracted Keyphrases Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning topics construct user profiles explicit specification interests automatic analysis web pages visited {topics construct user, construct user profiles, user profiles explicit, profiles explicit specification, explicit specification interests, specification interests automatic, automatic analysis web, analysis web pages, web pages visited }

12 Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited Extracted Keyphrases Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning topics construct user profiles explicit specification interests automatic analysis web pages visited profiles explicit specication : 0.0281 explicit specication interests : 0.0281 specication interests automatic : 0.0272 user proles explicit : 0.0260 construct user proles : 0.0260 interests automatic analysis : 0.0255 topics construct user : 0.0243 automatic analysis web : 0.0227 web pages visited : 0.0226 analysis web pages : 0.0217

13 Scoring Phraseness Phraseness Measures degree to which a given n-gram can be considered a phrase Measures degree to which a given n-gram can be considered a phrase Based on Co-occurrence of words Based on Co-occurrence of words Example.. Example.. Informativeness Informativeness Measures how informative a given n-gram is Measures how informative a given n-gram is There is a, a lot of etc There is a, a lot of etc Comparing co occurrence on a general corpus Vs given text(book) Comparing co occurrence on a general corpus Vs given text(book) Total Score Total Score Phraseness-Score + Informativeness-Score Phraseness-Score + Informativeness-Score

14 Scoring - Phraseness Computed by measuring distance between unigram model and N-gram model Computed by measuring distance between unigram model and N-gram model Point wise KL-divergence (Takashi et. al 2004) Point wise KL-divergence (Takashi et. al 2004) δ δ w (p||q) = p(w)log(p(w)/q(w)) Phraseness measure Phraseness measure δ δ w (LM fg N || LM fg 1 )

15 Scoring - Informativeness Computed by measuring distance between n-gram model from given data and n- gram model from general data Computed by measuring distance between n-gram model from given data and n- gram model from general data Point wise KL-divergence (Takashi et. al 2004) Point wise KL-divergence (Takashi et. al 2004) δ δ w (p||q) = p(w)log(p(w)/q(w)) Informativeness measure Informativeness measure δ δ w (LM fg 1 || LM bg 1 )

16 Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited Extracted Keyphrases Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning topics construct user profiles explicit specification interests automatic analysis web pages visited profiles explicit specication : 0.0281 explicit specication interests : 0.0281 specication interests automatic : 0.0272 user proles explicit : 0.0260 construct user proles : 0.0260 interests automatic analysis : 0.0255 topics construct user : 0.0243 automatic analysis web : 0.0227 web pages visited : 0.0226 analysis web pages : 0.0217

17 Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited Extracted Keyphrases Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning topics construct user profiles explicit specification interests automatic analysis web pages visited proles explicit specication explicit specication interests specication interests automatic user proles explicit construct user proles interests automatic analysis topics construct user automatic analysis web web pages visited analysis web pages

18

19 Conclusions and Future Work Discussed benefits of Keyphrases in ULIB context Discussed benefits of Keyphrases in ULIB context Demonstrated the building of a KPE that works for books Demonstrated the building of a KPE that works for books Robust evaluation Robust evaluation Building a test set from books in ULIB for generic robust evaluation of KPE tools Building a test set from books in ULIB for generic robust evaluation of KPE tools Are chapters really independent in a book Are chapters really independent in a book Revisit the assumption Revisit the assumption

20 Thank you

21 References 1. 1. Fred J. Damerau. Generating and evaluating domain-oriented multi-word terms from texts. Information Processing and Management, 29(4):433-447, 1993. 2. 2. S.T Dumais, J Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th international conference on information and knowledge management, page 148-155. ACM Press, 1998. 3. 3. Min Song, Il-Yeol Song, and Xiaohua Hu. Kpspotter: a exible information gain-based keyphrase extraction system. In WIDM '03: Proceedings of the 5th ACM international workshop on Web information and data management, pages 50-53, New York, NY, USA, 2003. ACM Press. 4. 4. Takashi Tomokiyo and Mathew Hurst. A language modeling approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on Multiword expressions, pages 33{40, Morristown, NJ, USA, 2003. Association for Computational Linguistics. 5. 5. P.D. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4):303-336, 2006. 6. 6. I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin, and C.G Nevill-Manning. Kea: Practical automatic keyphrase extraction. In E. A. Fox and N. Rowe, editors, Proceedings of digital libraries 99: The fourth ACM conference on digital libraries, pages 254-255. ACM Press, 1999. 7. 7. Mikio Yamamoto and Kenneth W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1-30, 2001


Download ppt "Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Bangalore"

Similar presentations


Ads by Google