Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,

Slides:



Advertisements
Similar presentations
eClassifier: Tool for Taxonomies
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 13.
Clustering Basic Concepts and Algorithms
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Object Class Recognition Using Discriminative Local Features Gyuri Dorko and Cordelia Schmid.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Recommender systems Ram Akella November 26 th 2008.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Unsupervised learning Generating “classes”
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Privacy-Preserving K-means Clustering over Vertically Partitioned Data Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
CS654: Digital Image Analysis
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Unsupervised Classification
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Image Mosaicing with Motion Segmentation from Video Augusto Roman, Taly Gilat EE392J Final Project 03/20/01.
ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Clustering of Web pages
Video Google: Text Retrieval Approach to Object Matching in Videos
Clustering Algorithms for Noun Phrase Coreference Resolution
Video Google: Text Retrieval Approach to Object Matching in Videos
Unsupervised Learning: Clustering
Presentation transcript:

Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Andrew Tomkins (Yahoo Research)

Introduction The problem of name ambiguity is widespread. They study the problem of disambiguating textual references to individuals. Their goal is to develop algorithms capable of clustering references to a particular name so that the resulting clusters correspond as closely as possible to the particular individuals. They explore the use of linguistically derived features and bottom up clustering to explore how well this disambiguation can be performed.

Applications –Dossier Creation –Relationship Detection –Person Search –Expertise Location –Authorship –Homepage Location Maintaining a sufficiently high precision is paramount, and recall may then be improved as much as possible.

Related Work Word sense disambiguation Name co-reference Place disambiguation Authors in citations disambiguation Templated-based extraction

Data – Sets of Names Two distinct data sets –Household name (famous actors/actresses or famous computer scientists/mathematicians) –General name (1000 names from analysis of web data) From these general names, they restricted to those which occurred at least 500 times within the 2.1B web pages. Then they used information the 1990 US Census to estimate the probability that a uniform person in that census would match both first and last name. (< 5X10 -8)

Data Gathering They used the full 2.8B pages of the IBM ’ s WebFountain system to gather data and run experiments. For each result, they extracted a region of 100 words centered around the name, and replaced each occurrence of the first and last name with FIRST and LAST respectively. The algorithm is asked to cluster references.

Feature Extraction Keywords –tfidf-scored tokenized keywords from the text snippets Entities –Any people ’ s name occurs on the entire pages. Any entity exists in the Stanford TAP knowledge base. Descriptions –Appositives and noun phrase modifiers that modify the name reference in the snippet. Phrases –Heads of all noun phrases in the snippet.

Example of Description

Clustering K-means Clustering –Any clusters that fell below a membership threshold (5) had their centroid reseeded into the center of the largest cluster plus a small offset. Incremental Clustering –Seed generation –Classification –Merging

Seed Generation The goal of the seed generation step is to form a set of highly precise seed clusters that need not cover the entire set of documents. Each feature is evaluated in turn in tfidf order, and perform one of three actions: –If this feature has not appeared in any page in seed clusters and occurs in more than a threshold number of pages, then … –If this feature has appeared in another seed cluster and the ratio is greater than a threshold, then … –Otherwise skip the feature.

Classification This step is to classify each page that was not assigned to a seed cluster. For each page, they find the cluster that is closest to the page in the feature space. If the distance is below a threshold then add it to that cluster. Otherwise, they find the cluster that is closest to it in their entity co-occurrence space. If the distance is below a threshold then add it to that cluster. If the page is not close enough to any existing cluster, then create a singleton cluster with just this page.

Cluster Merging The first two steps often create too many clusters, thus they add a final step to merge clusters. They merge clusters by repeatedly merging a cluster with its nearest neighbor in the feature space until there are no clusters that are close enough to it.

Evaluation Metric B-CUBED metric

Evaluating Features

Focus on Precision They computer the cohesion of a cluster, and find many smaller clusters have high precision. They give the algorithm the ability to endorse certain clusters as appearing to be of high quality. At approximately 10% of the data, the algorithm is able to select clusters of near-perfect precision. They consider the case of 10 distinct names with 200 snippets per name, resulting in 2000 data points to be clusters.

Incremental Clustering They consider terminating the algorithm after each phase.

Conclusion They have presented a technique for disambiguating occurrences of an ambiguous name from snippets of web text referring to individuals. They show that, over typically web references, with linguistically enhanced feature vectors and an incremental classifier, to return results for 25 % of the data with precision in excess of 0.95, out-performing the non- enhanced approach by a factor of around X2.5.