Authors: Yutaka Matsuo & Mitsuru Ishizuka Designed by CProDM Team.

Slides:



Advertisements
Similar presentations
Date: 2013/1/17 Author: Yang Liu, Ruihua Song, Yu Chen, Jian-Yun Nie and Ji-Rong Wen Source: SIGIR12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Adaptive.
Advertisements

From Words to Meaning to Insight Julia Cretchley & Mike Neal.
Supervisor: Mr. Phan Trường Lâm Supervisor:. Team information.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Improved TF-IDF Ranker
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
A UTOMATICALLY A CQUIRING A S EMANTIC N ETWORK OF R ELATED C ONCEPTS Date: 2011/11/14 Source: Sean Szumlanski et. al (CIKM’10) Advisor: Jia-ling, Koh Speaker:
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
WIMS 2014, Thessaloniki, June 2014 A soft frequent pattern mining approach for textual topic detection Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris.
Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Analyzing System Logs: A New View of What's Important Sivan Sabato Elad Yom-Tov Aviad Tsherniak Saharon Rosset IBM Research SysML07 (Second Workshop on.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Distributional clustering of English words Authors: Fernando Pereira, Naftali Tishby, Lillian Lee Presenter: Marian Olteanu.
S IMILARITY M EASURES FOR T EXT D OCUMENT C LUSTERING Anna Huang Department of Computer Science The University of Waikato, Hamilton, New Zealand BY Farah.
From Words to Meaning to Insight Julia Cretchley & Mike Neal.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Advanced Multimedia Text Classification Tamara Berg.
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
A hybrid method for Mining Concepts from text CSCE 566 semester project.
Supervisor: Mr. Phan Trường Lâm Supervisor:. Team information.
National Institute of Informatics Kiyoko Uchiyama 1 A Study for Introductory Terms in Logical Structure of Scientific Papers.
Word Weighting based on User’s Browsing History Yutaka Matsuo National Institute of Advanced Industrial Science and Technology (JPN) Presenter: Junichiro.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Date: 2013/8/27 Author: Shinya Tanaka, Adam Jatowt, Makoto P. Kato, Katsumi Tanaka Source: WSDM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Estimating.
A Language Independent Method for Question Classification COLING 2004.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
You Are What You Tag Yi-Ching Huang and Chia-Chuan Hung and Jane Yung-jen Hsu Department of Computer Science and Information Engineering Graduate Institute.
This paper was presented at KDD ‘06 Discovering Interesting Patterns Through User’s Interactive Feedback Dong Xin Xuehua Shen Qiaozhu Mei Jiawei Han Presented.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
Gao Cong, Long Wang, Chin-Yew Lin, Young-In Song, Yueheng Sun SIGIR’08 Speaker: Yi-Ling Tai Date: 2009/02/09 Finding Question-Answer Pairs from Online.
Chapter 23: Probabilistic Language Models April 13, 2004.
Automatic Identification of Pro and Con Reasons in Online Reviews Soo-Min Kim and Eduard Hovy USC Information Sciences Institute Proceedings of the COLING/ACL.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Extracting Keyphrases to Represent Relations in Social Networks from Web Junichiro Mori and Mitsuru Ishizuka Universiry of Tokyo Yutaka Matsuo National.
National Taiwan University, Taiwan
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.
Distribution of information in biomedical abstracts and full- text publications M. J. Schuemie et al. Dept. of Medical Informatics, Erasmus University.
NEW EVENT DETECTION AND TOPIC TRACKING STEPS. PREPROCESSING Removal of check-ins and other redundant data Removal of URL’s maybe Stemming of words using.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Tag-based Social Interest Discovery By yjhuang Yahoo! Inc Searcher Xin Li, Lei Guo, Yihong(Eric) Zhao 此投影片所有權為該著作者所有,在此僅作講解使用。將於最後附上出處.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Link Distribution on Wikipedia [0407]KwangHee Park.
Using Game Reviews to Recommend Games Michael Meidl, Steven Lytinen DePaul University School of Computing, Chicago IL Kevin Raison Chatsubo Labs, Seattle.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Your caption here POLYPHONET: An Advanced Social Network Extraction System from the Web Yutaka Matsuo Junichiro Mori Masahiro Hamasaki National Institute.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
CSM06: Information Retrieval Notes about writing coursework reports, revision and examination.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Similarity Measures for Text Document Clustering
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Clustering of Web pages
Compact Query Term Selection Using Topically Related Text
Multilingual Summarization with Polytope Model
Presentation transcript:

Authors: Yutaka Matsuo & Mitsuru Ishizuka Designed by CProDM Team

Introduction Algorithm implement Evaluation Outline Study Algorithm

Introduction Discard stop words Stem Extract frequency Select frequent term Clustering Expected probability Calculate X’ 2 value Output

Study Algorithm Preprocessing Goal: - Remove unnecessary words in document. - Get terms which are candidate keywords. Stop word: the function words and, the, and of, or other words with minimal lexical meaning. Stem: remove suffixes from words

Discard stop words It might be urged that when playing the “imitation game" the best strategy for the machine may possibly be something other than imitation of the behaviour of a man. This may be, but I think it is unlikely that there is any great effect of this kind. In any case there is no intention to investigate here the theory of the 2 game, and it will be assumed that the best strategy is to try to provide answers that would naturally be given by a man. urged playing “imitation game" best strategy machine possibly imitation behaviour man think unlikely great effect kind. case intention investigate theory 2 game, assumed best strategy try provide answers naturally given man. Study Algorithm Preprocessing

urged playing “imitation game" best strategy machine possibly imitation behaviour man think unlikely great effect kind. case intention investigate theory game, assumed best strategy try provide answers naturally given man. Stem urge play “imitation game" best strategi machine possible imitation behaviour man think unlike great effect kind. case intention investigate theory game, assum best strategi try provide answers natural give man. Study Algorithm Preprocessing

imitation best strategi man best strategi Extract frequency urge play “imitation game" best strategi machine possible imitation behaviour man think unlike great effect kind. case intention investigate theory game, assum best strategi try provide answers natural give man. Study Algorithm Preprocessing

Study Algorithm Term Co-occurrence and Importance the top ten frequent terms (denoted as ) and the probability of occurrence, normalized so that the sum is to be 1

Study Algorithm Term Co-occurrence and Importance Two terms in a sentence are considered to co-occur once.

co-occurrence probability distribution of some terms and the frequent terms. Study Algorithm Term Co-occurrence and Importance

The statistical value of χ2 is defined as P g Unconditional probability of a frequent term g ∈ G (the expected probability) N w The total number of co-occurrence of term w and frequent terms G freq (w, g) Frequency of co-occurrence of term w and term g Study Algorithm Term Co-occurrence and Importance

Study Algorithm Term Co-occurrence and Importance

Study Algorithm Algorithm improvement P g (the sum of the total number of terms in sentences where g appears) divided by (the total number of terms in the document) N w The total number of terms in the sentences where w appears including w If a term appears in a long sentence, it is likely to co-occur with many terms; if a term appears in a short sentence, it is less likely to co-occur with other terms. We consider the length of each sentence and revise our definitions

the following function to measure robustness of bias values Study Algorithm Algorithm improvement

To improve extracted keyword quality, we will cluster terms Two major approaches (Hofmann & Puzicha 1998) are:  Similarity-based clustering If terms w1 and w2 have similar distribution of co- occurrence with other terms, w1 and w2 are considered to be the same cluster.  Pairwise clustering If terms w1 and w2 co-occur frequently, w1 and w2 are considered to be the same cluster. Study Algorithm Algorithm improvement

Similarity-based clustering centers upon Red Circles Pairwise clustering focuses on Yellow Circles Study Algorithm Algorithm improvement

Where: Similarity-based clustering Cluster a pair of terms whose Jensen-Shannon divergence is and: Study Algorithm Algorithm improvement

Cluster a pair of terms whose mutual information is Pairwise clustering Where: Study Algorithm Algorithm improvement

Study Algorithm Algorithm improvement

Algorithm Implement

Discard stop words Stem Extract frequency Algorithm Implement Step 1: Preprocessing

Algorithm Implement Step 2: Selection of frequent terms Select the top frequent terms up to 30% of the number of running terms as a standard set of terms Count number of terms in document (Ntotal )

Algorithm Implement Step 3: Clustering frequent terms Similarity-base clustering Pairwise clustering

Algorithm Implement Step 4: Calculate expected probability Count the number of terms co-occurring with c ∈ C, denoted as n c, to yield the expected probability

Algorithm Implement Step 5: Calculate χ’2 value Where: the number of co-occurrence frequency with c ∈ C the total number of terms in the sentences including w

Algorithm Implement Step 6: Output keywords

Evaluation

In this paper, we developed an algorithm to extract keywords from a single document. Main advantages of our method are its simplicity without requiring use of a corpus and its high performance comparable to tfidf algorithm. As more electronic documents become available, we believe our method will be useful in many applications, especially for domain- independent keyword extraction.

Thank for your attention