2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.

2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International World Wide Web Conference, ACM WWW, 2009 Speaker: Chien-Liang Wu

Outline Motivation Framework of Key Terms Extraction Candidate Terms Extraction Word Sense Disambiguation Building Semantic Graph Discovering Community Structure of the Semantic Graph Selecting Valuable Communities Experiments Conclusions 2

Motivation Key Terms Extraction Basic step for various NLP tasks Document classification Document clustering Text summarization Challenges Web pages are typically noisy Side bars/menus, comments, … Dealing with multi-theme Web pages Portal home pages 3

Motivation (cont.) State-of-the-art Approaches to Key Terms Extraction Based on statistical learning TFxIDF model, keyphrase-frequency, … Require training set Approach in this paper Based on analyzing syntactic or semantic term relatedness within a document Compute semantic relatedness between terms using Wiki Model document as a semantic graph of terms and apply graph analysis techniques to it No training set required 4

Framework 5

Candidate Terms Extraction Goal: Extract all terms from the document For each term prepare a set of Wikipedia articles that can describe its meaning Parse the input document and extract all possible n-grams For each n-gram (+ its morphological variations) provide a set of Wikipedia article titles "drinks", "drinking", "drink" => [Wikipedia:] Drink; Drinking Avoid nonsense phrases appearing in the results "using", "electric cars are",… 6

Word Sense Disambiguation Goal: Choose the most appropriate Wikipedia article from the set of candidate articles for each ambiguous term extracted on the previous step Reference: “Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation”, SYRCoDIS, 2008 7

Word Sense Disambiguation (contd.) Example Text: Jigsaw is W3C's open-source project that started in May 1996. It is a web server platform that provides a sample HTTP 1.1 implementation and … Ambiguous term: “platform” Four Wikipedia concepts around this word: “open-source”, “web server”, “HTTP”, and “implementation” 8

Word Sense Disambiguation (contd.) A neighbor of an article All Wikipedia articles that have an incoming or an outgoing link to the original article Each term is assigned with a single Wikipedia article that describes its meaning 9

Building Semantic Graph Goal: Build document semantic graph using semantic relatedness between terms Semantic graph is a weighted graph Vertex: term Edge: two vertices are semantically related Weight of edge: semantic relatedness measure of the two terms 10

Building Semantic Graph (contd.) Using Dice-measure for Wikipedia-based semantic relatedness (reference: SYRCoDIS, 2008) Weights for various link types 11 Where n(A) is the neighbors of article A

Detecting Community Structure of the Semantic Graph using Newman Algorithm 12 A news article: "Apple to Make ITunes More Accessible For the Blind"

Selecting Valuable Communities Goal: rank term communities in a way that: the higher ranked communities contain key terms the lower ranked communities contain not important terms, and possible disambiguation mistakes Use Density of community C i : 13

Selecting Valuable Communities (contd.) Informativeness of community C i : Higher values to the named entities (for example, Apple Inc., Steve Jobs, Braille) than to general terms (Consumer, Agreement, Information) Community rank: density*informativeness 14 Where: count(D Link ) is the number of Wikipedia articles in which this term appears as a link count(D term ) is the total number of articles in which it appears

Selecting Valuable Communities (contd.) Decline is a border between important and non-important term communities For test collection, decline coincides with the maximum F- measure in 73% 15

Experiment Noise-free dataset 252 posts from 5 technical blogs 22 annotators took part in this experiment Each document was analyzed by 5 different annotators A key term was valid if at least two participants identified it For each document, two sets of key terms were built Finally, got 2009 key terms, 93% of them are Wiki titles 16 Uncontrolled key terms Controlled key terms Each annotator identified 5~10 key terms Match Wiki article title

Evaluation 17

Results Revision of precision and recall Communities-based method extracts more related terms in each thematic group than a human  better terms coverage Each participant reviewed these automatically extracted key terms and, if possible, extended his manually identified key terms 389 additional manually selected key terms Precision ↑ 46.1%, recall ↑ 67.7% 18

Evaluation on Web Pages 509 real-world web pages: Manually select key terms from web pages in the same manner Noise stability 19

Evaluation on Web Pages (contd.) Multi-theme stability 50 web pages with diverse topics News websites and home pages of Internet portals with lists of featured articles Result: 20

Conclusion Extract key terms from text document No training dataset required Wikipedia-based knowledge base Word sense disambiguation Semantic graph Semantic relatedness Valuable key term communities 21

2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.

Similar presentations

Presentation on theme: "2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.

Similar presentations

Presentation on theme: "2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International."— Presentation transcript:

Similar presentations

About project

Feedback