By : asef poormasoomi Supervisor : Dr. Kahani autumn 2010 Ferdowsi University of Mashad.

By : asef poormasoomi Supervisor : Dr. Kahani autumn 2010 Ferdowsi University of Mashad

Introduction summary: brief but accurate representation of the contents of a document

Is this the best we can do? Motivation Abstracts for Scientific and other articles News summarization (mostly Multiple document summarization)‏ Classification of articles and other written data Web pages for search engines Web access from PDAs, Cell phones Question answering and data gathering

Extract vs. abstract lists fragments of text vs. re-phrases content coherently. example : He ate banana, orange and apple=> He ate fruit Generic vs. query-oriented provides author’s view vs. reflects user’s interest. example : question answering system Personal vs. general consider reader’s prior knowledge vs. general. Single-document vs. multi-document source based on one text vs. fuses together many texts. Input text, video, image, map Genres

Methods Statistical scoring methods (Pseudo) Higher semantic/syntactic structures Network (graph) based methods Semantic based methods(LSA, ontology, WordNet) Other methods (rhetorical analysis, lexical chains, co- reference chains) AI methods

Statistical scoring (Pseudo) General method: 1. score each entity (sentence, word) ; 2. combine scores; 3. choose best sentence(s) Scoring tecahniques: Word frequencies throughout the text (Luhn 58) Position in the text (Edmunson 69, Lin&Hovy 97) Title method (Edmunson 69) Cue phrases in sentences (Edmunson 69) Bayesian Classifier (Kupiec at el 95)

Methods Statistical scoring methods problems : Synonymy: one concept can be expressed by different words. example cycle and bicycle refer to same kind of vehicle. Polysemy: one word or concept can have several meanings. example, cycle could mean life cycle or bicycle. Phrases: a phrase may have a meaning different from the words in it. An alleged murderer is not a murderer (Lin and Hovy 1997) Higher semantic/syntactic structures Network (graph) based methods Other methods (rhetorical analysis, lexical chains, co-reference chains) AI methods

LSI based summarization (Gong, 2001) Make Term-Sentence Matrix Apply SVD on Term-Sentence Matrix Problem  TFISF con not show context and relation correctly

Proposed Approach Preprocessing ( Tokenizing, Stopword, Stemming) Extract Context ( Use LSA on Term-Document ) Extract Perspective( SRL and WordNet ) Summary Generation

Proposed Approach Preprocessing Tokenizing And Remove Stop words Stemming and make Term-Document matrix A Extract Context Use SVD on A and use matrix U(term-Concept) Calculate Cosine distance between Concepts And Documents Calculate Cosine distance between Sentences And Concept of each Topic Rank Sentences

Proposed Approach Extract Perspective Use SRL and WordNet for sentence similarity Cosine Distance Problem S1 = United States Army, successfully tested an anti-missile defense system. S2 = U.S. military projectile interceptor, streaked into space and hit the target. S3 = Iran's weekend test of a long-range missile underscored the need for a U.S. national missile defense system. Semantic Similarity  S1 = United States Army, successfully tested an anti-missile defense system. subject AM-MNR verb object Summary Generation Remove Redundancy and Rank Sentence

Evaluation Tools & Summarization Systems ROUGE : Recall-Oriented Understudy for Gisting Evaluation  Types : ROUGE-N، ROUGE-L، ROUGE-W ، ROUGE-S, ROUGE-SU MEAD  http : //www.summarization.com/mead  chinese, english, japanese, dutch DMSumm  http : //www. icmc.usp.br /~taspardo/DMSumm.htm  portuguese, english SweSum (Martin Hassel)  http://swesum.nada.kth.se/index-eng.html  english, german, italian, spanish, greek,... FarsiSum( Nima Mazdak, Martin Hassel) o http://swesum.nada.kth.se/index-eng.html  SUMMARIST  PERSIVAL  GLEANS  SumUM  RIPTIDES  NTT  GISTSumm  GISTexter  DiaSumm  NeATS

[1] I. Mani. Automatic summarization. John Benjamins Publishing Company, 2001. [2] Yeh, J. Y., Ke, H. R., Yang, W. P., & Meng, I. H. Text summarization using a trainable summarizer and latent semantic analysis. Information Processing and Management, 41, 75-95, 2005. [3] Gong, Y., & Liu, X. Generic text summarization using relevance measure and latent semantic analysis. In Proceedings of the 24 th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR`01, New Orleans, 2001. [4] Steinberger, J., & Kabadjov, M.A. & Poesio, M., & Sanchez-Graillet,O. Improving LSA-based summarization with anaphora resolution. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing. 2005. [5] Yu, H. News summarization based on semantic similarity measure. Ninth International Conference on Hybrid Intelligent Systems, vol. 1, pp.180-183, 2009. [6] C. H. Papadimitrious, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing:A probabilistic analysis. J. Comput. Syst. Sci., 61(2):217-235, 2000. [7] C. –Y. Lin and Hovy. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proccedings of NLT-NAACL, 2003. [8] Nomoto, T., & Yuji, M. A new approach to unsupervised text summarization. In Proceedings of the 24 th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR`01. New Orleans, Louisiana, United States, 2001. [9] J. Lee, S. Park, C. Ahn, D. Kim. Automatic generic document summarization based on non-negative matrix factorization. Information Processing and Management 2008. [10] Steinberger, J., & Poesio, M.,& Kabadjov, M.A. & Jeek, K.Two uses of anaphora resolution in summarization. Information Processing and Management: an International Journal, vol 43, November, 2007. [11] D. Wang, T. Li, S. Zhu, C. Ding. Multi-Document summarization via sentence-level semantic analysis and symmetric matrix factorization. SIGIR’08, July 2008, Singapore. [12] V. Gupta, G. S. Lehal, A Survey of Text Summarization Extractive Techniques. Journal of emerging thechnologies in web intelligence, august 2010 References

thanks

Document Understanding Conferences AQUAINT corpus Associated Press and New York Times(1998-2000) & Xinhua News Agency(1996-2000) Totally 1125 Documents 25 Document In each Topic 25 Document In each Topic 45 Topics Dataset Specifications 20057 Terms By Stemming & without S.W 20057 Terms By Stemming & without S.W 262225 Terms without S.W 531174 Terms ROUGE-2 ROUGE-SU4 ROUGE-2 ROUGE-SU4 32 system summarizer Each Topic has 4 human summary Ten NIST assessors wrote summaries for the 45 topics in the DUC 2007 main task.

Experimental Result Recall On ROUGE-2 Average result on 3 topics

Experimental Result Recall On ROUGE-SU4 Average result on 3 topics

The Best … Topic : US missile defense system WordResultSystemsEvaluation 288 0.14273 1-2-3-4-8-9-16-20-21-29ROUGE-2 250 0.13599 15ROUGE-2 328 0.21631 1-3-4-8-9-14-16-22-28-29 ROUGE-SU4 250 0.18899 15ROUGE-SU4

By : asef poormasoomi Supervisor : Dr. Kahani autumn 2010 Ferdowsi University of Mashad.

Similar presentations

Presentation on theme: "By : asef poormasoomi Supervisor : Dr. Kahani autumn 2010 Ferdowsi University of Mashad."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

By : asef poormasoomi Supervisor : Dr. Kahani autumn 2010 Ferdowsi University of Mashad.

Similar presentations

Presentation on theme: "By : asef poormasoomi Supervisor : Dr. Kahani autumn 2010 Ferdowsi University of Mashad."— Presentation transcript:

Similar presentations

About project

Feedback