2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.

Slides:



Advertisements
Similar presentations
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Advertisements

Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Improved TF-IDF Ranker
Linking Named Entity in Tweets with Knowledge Base via User Interest Modeling Date : 2014/01/22 Author : Wei Shen, Jianyong Wang, Ping Luo, Min Wang Source.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Toward Whole-Session Relevance: Exploring Intrinsic Diversity in Web Search Date: 2014/5/20 Author: Karthik Raman, Paul N. Bennett, Kevyn Collins-Thompson.
Title Course opinion mining methodology for knowledge discovery, based on web social media Authors Sotirios Kontogiannis Ioannis Kazanidis Stavros Valsamidis.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Encyclopaedic Annotation of Text.  Entity level difficulty  All the entities in a document may not be in reader’s knowledge space  Lexical difficulty.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Automatic Discovery of Technology Trends from Patent Text Youngho Kim, Yingshi Tian, Yoonjae Jeong, Ryu Jihee, Sung-Hyon Myaeng School of Engineering Information.
Web Mining Research: A Survey
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
Markup and Validation Agents in Vijjana – A Pragmatic model for Self- Organizing, Collaborative, Domain- Centric Knowledge Networks S. Devalapalli, R.
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong Sun 2010, ACM Automatic Keyphrase Extraction.
Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Page 1 March 2011 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1, Dan Roth 1, Doug Downey 2, Mike Anderson 3 1 University of.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Evgeniy Gabrilovich and Shaul Markovitch
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Concept-based Short Text Classification and Ranking
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
1 The EigenRumor Algorithm for Ranking Blogs Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen ( 嚴聖筌 )
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
GRAPH BASED MULTI-DOCUMENT SUMMARIZATION Canan BATUR
Queensland University of Technology
Using lexical chains for keyword extraction
Presented by: Hassan Sayyadi
Lecture 24: NER & Entity Linking
Applying Key Phrase Extraction to aid Invalidity Search
Clustering Algorithms for Noun Phrase Coreference Resolution
Summarization for entity annotation Contextual summary
Enriching Taxonomies With Functional Domain Knowledge
Using Uneven Margins SVM and Perceptron for IE
Web Mining Research: A Survey
Presentation transcript:

2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International World Wide Web Conference, ACM WWW, 2009 Speaker: Chien-Liang Wu

Outline Motivation Framework of Key Terms Extraction Candidate Terms Extraction Word Sense Disambiguation Building Semantic Graph Discovering Community Structure of the Semantic Graph Selecting Valuable Communities Experiments Conclusions 2

Motivation Key Terms Extraction Basic step for various NLP tasks Document classification Document clustering Text summarization Challenges Web pages are typically noisy Side bars/menus, comments, … Dealing with multi-theme Web pages Portal home pages 3

Motivation (cont.) State-of-the-art Approaches to Key Terms Extraction Based on statistical learning TFxIDF model, keyphrase-frequency, … Require training set Approach in this paper Based on analyzing syntactic or semantic term relatedness within a document Compute semantic relatedness between terms using Wiki Model document as a semantic graph of terms and apply graph analysis techniques to it No training set required 4

Framework 5

Candidate Terms Extraction Goal: Extract all terms from the document For each term prepare a set of Wikipedia articles that can describe its meaning Parse the input document and extract all possible n-grams For each n-gram (+ its morphological variations) provide a set of Wikipedia article titles "drinks", "drinking", "drink" => [Wikipedia:] Drink; Drinking Avoid nonsense phrases appearing in the results "using", "electric cars are",… 6

Word Sense Disambiguation Goal: Choose the most appropriate Wikipedia article from the set of candidate articles for each ambiguous term extracted on the previous step Reference: “Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation”, SYRCoDIS,

Word Sense Disambiguation (contd.) Example Text: Jigsaw is W3C's open-source project that started in May It is a web server platform that provides a sample HTTP 1.1 implementation and … Ambiguous term: “platform” Four Wikipedia concepts around this word: “open-source”, “web server”, “HTTP”, and “implementation” 8

Word Sense Disambiguation (contd.) A neighbor of an article All Wikipedia articles that have an incoming or an outgoing link to the original article Each term is assigned with a single Wikipedia article that describes its meaning 9

Building Semantic Graph Goal: Build document semantic graph using semantic relatedness between terms Semantic graph is a weighted graph Vertex: term Edge: two vertices are semantically related Weight of edge: semantic relatedness measure of the two terms 10

Building Semantic Graph (contd.) Using Dice-measure for Wikipedia-based semantic relatedness (reference: SYRCoDIS, 2008) Weights for various link types 11 Where n(A) is the neighbors of article A

Detecting Community Structure of the Semantic Graph using Newman Algorithm 12 A news article: "Apple to Make ITunes More Accessible For the Blind"

Selecting Valuable Communities Goal: rank term communities in a way that: the higher ranked communities contain key terms the lower ranked communities contain not important terms, and possible disambiguation mistakes Use Density of community C i : 13

Selecting Valuable Communities (contd.) Informativeness of community C i : Higher values to the named entities (for example, Apple Inc., Steve Jobs, Braille) than to general terms (Consumer, Agreement, Information) Community rank: density*informativeness 14 Where: count(D Link ) is the number of Wikipedia articles in which this term appears as a link count(D term ) is the total number of articles in which it appears

Selecting Valuable Communities (contd.) Decline is a border between important and non-important term communities For test collection, decline coincides with the maximum F- measure in 73% 15

Experiment Noise-free dataset 252 posts from 5 technical blogs 22 annotators took part in this experiment Each document was analyzed by 5 different annotators A key term was valid if at least two participants identified it For each document, two sets of key terms were built Finally, got 2009 key terms, 93% of them are Wiki titles 16 Uncontrolled key terms Controlled key terms Each annotator identified 5~10 key terms Match Wiki article title

Evaluation 17

Results Revision of precision and recall Communities-based method extracts more related terms in each thematic group than a human  better terms coverage Each participant reviewed these automatically extracted key terms and, if possible, extended his manually identified key terms 389 additional manually selected key terms Precision ↑ 46.1%, recall ↑ 67.7% 18

Evaluation on Web Pages 509 real-world web pages: Manually select key terms from web pages in the same manner Noise stability 19

Evaluation on Web Pages (contd.) Multi-theme stability 50 web pages with diverse topics News websites and home pages of Internet portals with lists of featured articles Result: 20

Conclusion Extract key terms from text document No training dataset required Wikipedia-based knowledge base Word sense disambiguation Semantic graph Semantic relatedness Valuable key term communities 21