Single Document Key phrase Extraction Using Neighborhood Knowledge.

Single Document Key phrase Extraction Using Neighborhood Knowledge

Introduction Key phrases are words, or groups of words, that capture the key ideas of a document. They represent important information concerning a document and constitute an alternative, or a complement, to full-text indexing. Appropriate key phrases can serve as a highly condensed summary for a document, and they can be used as a label for the document to supplement or replace the title or summary, or they can be highlighted within the body of the document to facilitate users’ fast browsing and reading Key phrase extraction has huge arena. Document key phrases have been successfully used in the following IR and NLP tasks: document indexing, document classification, document clustering and document summarization. Key phrases are usually manually assigned by authors, especially for journal or conference articles. But since document corpus is increasing rapidly, automation of identification of key phrases process is necessary.

EXISTING SYSTEM Existing methods conduct the key phrase extraction task using only the information contained in the specified document, including the phrase’s TFIDF, position and other syntactic information in the document. One major drawback of existing methods is the illusion that documents are independent of each other. And the key phrase extraction task is conducted separately without interactions for each document. However in real scenario, most of the documents are correlated, such topic-related documents actually have mutual influences and contain useful clues which can help to extract key phrases from each other. Existing system doesn’t take into consideration the neighboring info which would have been a great help in key phrase extraction.

The methods for key phrase (or keyword) extraction can be roughly categorized into either unsupervised or supervised. Unsupervised methods usually involve assigning a score to each candidate phrases by considering various features such as syntactic clues, use of italics, the presence of phrases in section headers, use of acronyms. Supervised machine learning algorithms have been proposed to classify a candidate phrase into either key phrase or not. the most important features for classifying a candidate phrase are the frequency and location of the phrase in the document. above methods make use of only the information contained in the specified document.

Proposed System In the proposed system we will find the neighbor documents for the given document d0,for which key phrases are identified The neighbor documents are topically close to the specified document and they construct the neighborhood knowledge context for the specified document. document d0 is expanded to a small document set D which provides more knowledge and clues for key phrase extraction from d0. Once the document set is constructed, the proposed system adapts graph based algorithm to incorporate both the word relationships in d0 and the word relationships in neighbor documents

ALGORITHM The algorithm comprises of following two phases 1. Neighborhood Construction: Expand the specified document d0 to a small document set D={d0, d1,d2,…dk} by adding k neighbor documents. The neighbor documents d1, d2, …, dk can be obtained by using document similarity search techniques 2. Key phrase Extraction Given document d0 and the expanded document set D, perform the following steps to extract key phrases for d0: a) Neighborhood-level Word Evaluation: here the key phrases are calculated among the neighboring docs and the global affinity graph is constructed b) Document-level Key phrase Extraction: for the specified document d0, evaluate the candidate phrases in the document based on the scores of the words contained in the phrases

Neighborhood Construction The neighbor documents can be obtained by using the technique of document similarity search Document similarity search is to find documents similar to a query document in a text corpus and return a ranked list of similar documents to users. In the proposed algorithm the similarity search between any two documents di and dj can be calculated by using the parameter similarity simdoc( di, dj ) which is being calculated as sim (di,dj ) =(di.dj)/(|di|*|dj|) here di,dj are document vectors We will calculate sim (di,dj ) for various docs and the top k documents are being added to form a document set

sim(di,dj) serves as confidence value which is associated with every document in the expanded document set, which reflects out belief that the document is sampled from the same underlying model as the specified document. When a document is close to the specified one, the confidence value is high, but when it is farther apart, the confidence value will be reduced

Key phrase Extraction A ) Neighborhood-Level Word Evaluation The graph-based ranking algorithm implemented here is essentially a way of deciding the importance of a vertex within a graph based on global information recursively drawn from the entire graph. The basic idea is that of “voting” or “recommendation” between the vertices. A link between two vertices is considered as a vote cast from one vertex to the other vertex. The score associated with a vertex is determined by the votes that are cast for it, and the score of the vertices casting these votes given the expanded document set D, we build an undirected graph called Global Affinity Graph. let G=(V,E) be a Global Affinity Graph that reflect the relationships between words in the document set. V is the set of vertices and each vertex is a candidate word in the document set

Each edge eij in E is associated with an affinity weight aff( vi, vj ) between words vi and vj. The weight is computed based on the co occurrence relation between the two words, controlled by the distance between word occurrences. aff (vi,vj)=Σ sim (d0,dp )*count dp ( vi,vj) where count dp (vi,vj ) is the count of the controlled co -occurrences between words vi and vj in document dp We use an affinity matrix M to describe G with each entry corresponding to the weight of an edge in the graph. M = (Mi,j)|V|×|V| is defined as M i,j=aff (vi,vj) if vi links with vj i!=j M i,j =0,otherwise After normalizing the given matrix M the word score of each word vi is calculated by using iterative formula Wordscore(vi)=μΣi!=j WordScore(vj)*M( i,j )+ (1-μ)/|v|

Where V is no of words, μ is the damping factor usually set to 0.85 initially score of all words are set to 1. Usually the convergence of the iteration algorithm is achieved when the difference between the scores computed at two successive iterations for any words falls below a given threshold(for example 0.0001)

B) Document-Level Key phrase Extraction After the scores of all candidate words in the document set have been computed, candidate phrases (either single-word or multi-word) are selected and evaluated for the specified document d0. The candidate words (i.e. nouns and adjectives) of d0, which is a subset of V, are marked in the text of document d0, and sequences of adjacent candidate words are collapsed into a multi-word phrase. General strategy that is found is The phrases ending with an adjective is not allowed, and only the phrases ending with a noun are collected as candidate phrases for the document

The score of a candidate phrase pi is computed by summing the neighborhood-level saliency scores of the words contained in the phrase. Phrase Score( p i)= Σ WordScore(v j) where vj is present in pi All the candidate phrases in document d0 are ranked in decreasing order of the phrase scores and the top m phrases are selected as the key phrases of d0.

Work Division Neighborhood Construction Neighborhood-level Word EvaluationDocument-level Key phrase Extraction

Tools Used Apache Lucene: Apache Lucene(TM) is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. JTextPro: A Java-based Text Processing toolkit that currently includes important processing steps for natural language/text processing as follows: Sentence boundary detection Word tokenization Part-of-speech tagging and Phrase chunking

Data Sets 100000 news articles corpus Document corpus for testing(having manually assigned key phrases)

Thank You

Single Document Key phrase Extraction Using Neighborhood Knowledge.

Similar presentations

Presentation on theme: "Single Document Key phrase Extraction Using Neighborhood Knowledge."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Single Document Key phrase Extraction Using Neighborhood Knowledge.

Similar presentations

Presentation on theme: "Single Document Key phrase Extraction Using Neighborhood Knowledge."— Presentation transcript:

Similar presentations

About project

Feedback