1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

Slides:



Advertisements
Similar presentations
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Advertisements

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Encyclopaedic Annotation of Text.  Entity level difficulty  All the entities in a document may not be in reader’s knowledge space  Lexical difficulty.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Automatic Discovery of Technology Trends from Patent Text Youngho Kim, Yingshi Tian, Yoonjae Jeong, Ryu Jihee, Sung-Hyon Myaeng School of Engineering Information.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Compare&Contrast: Using the Web to Discover Comparable Cases for News Stories Presenter: Aravind Krishna Kalavagattu.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Named Entity Disambiguation Based on Explicit Semantics Martin Jačala and Jozef Tvarožek Špindlerův Mlýn, Czech Republic January 23, 2012 Slovak University.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods Oh-Woog Kwon KLE Lab. CSE POSTECH.
Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann European Bioinformatics Institute, Wellcome Trust Genome.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Unsupervised Learning of Visual Sense Models for Polysemous Words Kate Saenko Trevor Darrell Deepak.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Page 1 March 2011 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1, Dan Roth 1, Doug Downey 2, Mike Anderson 3 1 University of.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Facilitating Document Annotation using Content and Querying Value.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Algorithmic Detection of Semantic Similarity WWW 2005.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
National Taiwan University, Taiwan
Evgeniy Gabrilovich and Shaul Markovitch
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
A Knowledge-Based Search Engine Powered by Wikipedia David Milne, Ian H. Witten, David M. Nichols (CIKM 2007)
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Single Document Key phrase Extraction Using Neighborhood Knowledge.
Artificial Intelligence Techniques Internet Applications 4.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
An Adaptive User Profile for Filtering News Based on a User Interest Hierarchy Sarabdeep Singh, Michael Shepherd, Jack Duffy and Carolyn Watters Web Information.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Neighborhood - based Tag Prediction
X Ambiguity & Variability The Challenge The Wikifier Solution
iSRD Spam Review Detection with Imbalanced Data Distributions
Introduction Dataset search
Presentation transcript:

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar

2 Wikify! Linking Documents to Encyclopedic Knowledge. R. Mihalcea and A. Csomai Learning to Link with Wikipedia. D. Milne and I. H. Witten

3 What is Wikification Automatic keyword extraction Word sense disambiguation Automatically cross-reference documents (unstructured text) with wikipedia.

4 Wikify! - Introduction Introduces annotation of documents by linking them with Wikipedia Applications could be semantic web, educational applications, useful in no. of text processing problems. Previous similar works: Microsoft Smart Tags, Google AutoLink merely based on word or phrase lookup (no keyword extraction or disambiguation)

5 Wikify! - Text Wikification

6 Wikify! - Keyword Extraction Recommendations from Wikipedia style manual: link terms providing deeper understanding of topic, avoid linking unrelated terms, select proper amount of keywords. Unsupervised algorithms: Involve two steps –Candidate extraction: extract all possible n-grams. –Keyword ranking: Assign numeric value to each candidate. Used three methods - tf-idf,  2, Keyphraseness.

7 Wikify! - Evaluation of Keyword Extraction

8 Wikify! - Word Sense Disambiguation Ambiguity is inherent to human language Disambiguation algorithms: –Knowledge-based: rely exclusively on knowledge derived from dictionaries. –Data-driven: based on probabilities collected from sense-annotated data. Here voting scheme is used which seeks agreement between both. Wikify! provides highly precise annotation even if recall is lower.

9 Wikify! - Disambiguation Evaluation Word sense disambiguation results: total number of attempted (A) and correct (C) word senses, together with precision (P), recall (R) and F-measure (F) evaluations.

10 Wikify! - Overall Evaluation and Conclusion Wikify! allows user to upload a text file or accepts URL of webpage, processes the document provided by the user, and finally returns the wikified version of the document. The user also has option of providing density of keywords in the range 2%-10% default being 6%. When it was evaluated by human evaluators (20 users evaluating 10 documents each) only 57% of the cases were identified accurately (50% would be ideal case).

11 Learning to Link with Wikipedia Machine learning approach to identify significant terms within unstructured text. It can provide structured knowledge about any unstructured text. Uses Wikipedia articles as training data, which improves recall and precision.

12 Snapshot of Wikified document

13 Learning to Disambiguate Links Uses disambiguation to inform detection. Features such as Commonness and Relatedness of the term are used as measures to resolve ambiguity. Commonness of a sense is defined by number of times it is used by wikipedia articles as destination. Commonness = (No. of times term is used as link) / (No. of times term appears in Wikipedia articles)

14 Disambiguation (Continued) Relatedness is given by following formula: Where a and b are two articles of interest A and B are sets of all articles that link to a and b respectively, and W is set of all articles in Wikipedia.

15 Disambiguation (Continued) Commonness and Relatedness

16 Disambiguation (Continued) All context terms are not equally useful, so weight is assigned to each context term which is average of its link probability (i.e. commonness) and relatedness. All the above features are combined and the feature of context quality is defined as sum of the weights that are previously assigned to each context term. These features are used to train the classifier. To configure the classifier, parameter specifying minimum probability of sense is used.

17 Disambiguation Evaluation Disambiguation classifier was trained over 500 articles (instead of entire Wikipedia) on a modest desktop with 3 GHz dual Core processor and 4GB of RAM. Classifier was configured using 100 wikipedia articles. It was trained in 13 minutes, and tested in 4 minutes and another 3 minutes were required to load required summaries of Wikipedia’s link structure and anchor statistics into memory. To evaluate classifier, anchors were gathered from 100 random articles.

18 Disambiguation Evaluation

19 Learning to Detect Links Central difference between Wikify’s link detection approach and this new link detector: Wikify exclusively relies on link probability, whereas in this new approach, the context surrounding the terms is also taken into consideration. This link detector discards only terms having very low link probability so that nonsense phrases and stop words are removed.

20 Features used for Link Detection Link probability: It considers average link probability. Relatedness: semantic relatedness, average relatedness between each topic and all other candidates. Disambiguation Confidence Generality Location and Spread

21 Link Detector

22 Link Detector Performance Same dataset as for disambiguation classifier was used for training, configuration as well as evaluation. 6.5% link probability was set as recall and precision balance at that point. Link detector was trained on unambiguous terms.

23 Link Detector Performance (Continued)

24 Wikification in the Wild This system was tested using news articles instead of wikipedia and it gave 76.4% accuracy in link detection.

25 Conclusions This system resolves ambiguity as well as polysemy. Common hurdle in all such applications: they must somehow move from unstructured text to collection of relevant wikipedia articles. This paper has contibuted proven method for extracting key concepts from plain text. Finally these are attempts to explain and organize sum total of human knowledge.

26 Application on itself

27 Questions ?