Single Document Key phrase Extraction Using Neighborhood Knowledge.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Clustering Categorical Data The Case of Quran Verses
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Information Retrieval in Practice
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Clustering Unsupervised learning Generating “classes”
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Text Classification, Active/Interactive learning.
Using Hyperlink structure information for web search.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Information Retrieval in Folksonomies Nikos Sarkas Social Information Systems Seminar DCS, University of Toronto, Winter 2007.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Presented by, Lokesh Chikkakempanna Authoritative Sources in a Hyperlinked environment.
Querying Structured Text in an XML Database By Xuemei Luo.
Chapter 1 Introduction to Statistics. Statistical Methods Were developed to serve a purpose Were developed to serve a purpose The purpose for each statistical.
Chapter 6: Information Retrieval and Web Search
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter:
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Algorithmic Detection of Semantic Similarity WWW 2005.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering Hongyuan Zha Department of Computer Science.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Information Retrieval in Practice
Clustering of Web pages
HITS Hypertext-Induced Topic Selection
Compact Query Term Selection Using Topically Related Text
Presented by Nick Janus
Presentation transcript:

Single Document Key phrase Extraction Using Neighborhood Knowledge

Introduction Key phrases are words, or groups of words, that capture the key ideas of a document. They represent important information concerning a document and constitute an alternative, or a complement, to full-text indexing. Appropriate key phrases can serve as a highly condensed summary for a document, and they can be used as a label for the document to supplement or replace the title or summary, or they can be highlighted within the body of the document to facilitate users’ fast browsing and reading Key phrase extraction has huge arena. Document key phrases have been successfully used in the following IR and NLP tasks: document indexing, document classification, document clustering and document summarization. Key phrases are usually manually assigned by authors, especially for journal or conference articles. But since document corpus is increasing rapidly, automation of identification of key phrases process is necessary.

EXISTING SYSTEM Existing methods conduct the key phrase extraction task using only the information contained in the specified document, including the phrase’s TFIDF, position and other syntactic information in the document. One major drawback of existing methods is the illusion that documents are independent of each other. And the key phrase extraction task is conducted separately without interactions for each document. However in real scenario, most of the documents are correlated, such topic-related documents actually have mutual influences and contain useful clues which can help to extract key phrases from each other. Existing system doesn’t take into consideration the neighboring info which would have been a great help in key phrase extraction.

The methods for key phrase (or keyword) extraction can be roughly categorized into either unsupervised or supervised. Unsupervised methods usually involve assigning a score to each candidate phrases by considering various features such as syntactic clues, use of italics, the presence of phrases in section headers, use of acronyms. Supervised machine learning algorithms have been proposed to classify a candidate phrase into either key phrase or not. the most important features for classifying a candidate phrase are the frequency and location of the phrase in the document. above methods make use of only the information contained in the specified document.

Proposed System In the proposed system we will find the neighbor documents for the given document d0,for which key phrases are identified The neighbor documents are topically close to the specified document and they construct the neighborhood knowledge context for the specified document. document d0 is expanded to a small document set D which provides more knowledge and clues for key phrase extraction from d0. Once the document set is constructed, the proposed system adapts graph based algorithm to incorporate both the word relationships in d0 and the word relationships in neighbor documents

ALGORITHM The algorithm comprises of following two phases 1. Neighborhood Construction: Expand the specified document d0 to a small document set D={d0, d1,d2,…dk} by adding k neighbor documents. The neighbor documents d1, d2, …, dk can be obtained by using document similarity search techniques 2. Key phrase Extraction Given document d0 and the expanded document set D, perform the following steps to extract key phrases for d0: a) Neighborhood-level Word Evaluation: here the key phrases are calculated among the neighboring docs and the global affinity graph is constructed b) Document-level Key phrase Extraction: for the specified document d0, evaluate the candidate phrases in the document based on the scores of the words contained in the phrases

Neighborhood Construction The neighbor documents can be obtained by using the technique of document similarity search Document similarity search is to find documents similar to a query document in a text corpus and return a ranked list of similar documents to users. In the proposed algorithm the similarity search between any two documents di and dj can be calculated by using the parameter similarity simdoc( di, dj ) which is being calculated as sim (di,dj ) =(di.dj)/(|di|*|dj|) here di,dj are document vectors We will calculate sim (di,dj ) for various docs and the top k documents are being added to form a document set

sim(di,dj) serves as confidence value which is associated with every document in the expanded document set, which reflects out belief that the document is sampled from the same underlying model as the specified document. When a document is close to the specified one, the confidence value is high, but when it is farther apart, the confidence value will be reduced

Key phrase Extraction A ) Neighborhood-Level Word Evaluation The graph-based ranking algorithm implemented here is essentially a way of deciding the importance of a vertex within a graph based on global information recursively drawn from the entire graph. The basic idea is that of “voting” or “recommendation” between the vertices. A link between two vertices is considered as a vote cast from one vertex to the other vertex. The score associated with a vertex is determined by the votes that are cast for it, and the score of the vertices casting these votes given the expanded document set D, we build an undirected graph called Global Affinity Graph. let G=(V,E) be a Global Affinity Graph that reflect the relationships between words in the document set. V is the set of vertices and each vertex is a candidate word in the document set

Each edge eij in E is associated with an affinity weight aff( vi, vj ) between words vi and vj. The weight is computed based on the co occurrence relation between the two words, controlled by the distance between word occurrences. aff (vi,vj)=Σ sim (d0,dp )*count dp ( vi,vj) where count dp (vi,vj ) is the count of the controlled co -occurrences between words vi and vj in document dp We use an affinity matrix M to describe G with each entry corresponding to the weight of an edge in the graph. M = (Mi,j)|V|×|V| is defined as M i,j=aff (vi,vj) if vi links with vj i!=j M i,j =0,otherwise After normalizing the given matrix M the word score of each word vi is calculated by using iterative formula Wordscore(vi)=μΣi!=j WordScore(vj)*M( i,j )+ (1-μ)/|v|

Where V is no of words, μ is the damping factor usually set to 0.85 initially score of all words are set to 1. Usually the convergence of the iteration algorithm is achieved when the difference between the scores computed at two successive iterations for any words falls below a given threshold(for example )

B) Document-Level Key phrase Extraction After the scores of all candidate words in the document set have been computed, candidate phrases (either single-word or multi-word) are selected and evaluated for the specified document d0. The candidate words (i.e. nouns and adjectives) of d0, which is a subset of V, are marked in the text of document d0, and sequences of adjacent candidate words are collapsed into a multi-word phrase. General strategy that is found is The phrases ending with an adjective is not allowed, and only the phrases ending with a noun are collected as candidate phrases for the document

The score of a candidate phrase pi is computed by summing the neighborhood-level saliency scores of the words contained in the phrase. Phrase Score( p i)= Σ WordScore(v j) where vj is present in pi All the candidate phrases in document d0 are ranked in decreasing order of the phrase scores and the top m phrases are selected as the key phrases of d0.

Work Division Neighborhood Construction Neighborhood-level Word EvaluationDocument-level Key phrase Extraction

Tools Used Apache Lucene: Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. JTextPro: A Java-based Text Processing toolkit that currently includes important processing steps for natural language/text processing as follows: Sentence boundary detection Word tokenization Part-of-speech tagging and Phrase chunking

Data Sets news articles corpus Document corpus for testing(having manually assigned key phrases)

Thank You