Intelius-NYU Cold Start System Ang Sun, Xin Wang, Sen Xu, Yigit Kiran, Shakthi Poornima, Andrew Borthwick (Intelius Inc.) Ralph Grishman (New York University)

Slides:



Advertisements
Similar presentations
Overview of the TAC2013 Knowledge Base Population Evaluation: Temporal Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji,
Advertisements

Text Analysis Conference Knowledge Base Population 2013 Hoa Trang Dang National Institute of Standards and Technology Sponsored by:
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Location Mining from Online Social Networks
Distant Supervision for Knowledge Base Population Mihai Surdeanu, David McClosky, John Bauer, Julie Tibshirani, Angel Chang, Valentin Spitkovsky, Christopher.
Tri-lingual EDL Planning Heng Ji (RPI) Hoa Trang Dang (NIST) WORRY, BE HAPPY!
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Cross-Domain Bootstrapping for Named Entity Recognition Ang Sun Ralph Grishman New York University July 28, 2011 Beijing, EOS, SIGIR 2011 NYU.
Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
Distributed Computations MapReduce
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
A Web-based Question Answering System Yu-shan & Wenxiu
Indexing Techniques Mei-Chen Yeh.
Jan 4 th 2013 Event Extraction Using Distant Supervision Kevin Reschke.
Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
A Two Tier Framework for Context-Aware Service Organization & Discovery Wei Zhang 1, Jian Su 2, Bin Chen 2,WentingWang 2, Zhiqiang Toh 2, Yanchuan Sim.
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真.
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
Wei Xu, Ralph Grishman, Le Zhao (CMU) New York University Novmember 24, 2011.
1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.
Shanda Innovations Context-aware Ensemble of Multifaceted Factorization Models for Recommendation Kevin Y. W. Chen.
Presenter: Shanshan Lu 03/04/2010
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
EasyQuerier: A Keyword Interface in Web Database Integration System Xian Li 1, Weiyi Meng 2, Xiaofeng Meng 1 1 WAMDM Lab, RUC & 2 SUNY Binghamton.
Page 1 March 2011 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1, Dan Roth 1, Doug Downey 2, Mike Anderson 3 1 University of.
Ang Sun Director of Research, Principal Scientist, inome
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
How we all fit together ©2014 Inome, Inc. All Rights Reserved. Probabilistic Estimates of Attribute Statistics and Match Likelihood for People Entity Resolution.
DeepDive Introduction Dongfang Xu Ph.D student, School of Information, University of Arizona Sept 10, 2015.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Automatic Labeling of Multinomial Topic Models
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
DeepDive Case Study Dongfang Xu School of Information.
Cold-Start KBP Something from Nothing Sean Monahan, Dean Carpenter Language Computer.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Conditional Random Fields & Table Extraction Dongfang Xu School of Information.
PREDICTION ON TWEET FROM DYNAMIC INTERACTION Group 19 Chan Pui Yee Wong Tsz Wing Yeung Chun Kit.
Data Mining and Text Mining. The Standard Data Mining process.
Exploiting Wikipedia as External Knowledge for Document Clustering
NYU Coreference CSCI-GA.2591 Ralph Grishman.
X Ambiguity & Variability The Challenge The Wikifier Solution
Property consolidation for entity browsing
Introduction Task: extracting relational facts from text
Leverage Consensus Partition for Domain-Specific Entity Coreference
Text Annotation: DBpedia Spotlight
Unsupervised Learning: Clustering
Entity Linking Survey
Topic: Semantic Text Mining
Presentation transcript:

Intelius-NYU Cold Start System Ang Sun, Xin Wang, Sen Xu, Yigit Kiran, Shakthi Poornima, Andrew Borthwick (Intelius Inc.) Ralph Grishman (New York University)

Outline Cold Start Slot Filling System Entity Linking for Person and Organization Entity Linking for Geo-Political Entity (GPE) Experiments

Outline Cold Start Slot Filling System Entity Linking for Person and Organization Entity Linking for Geo-Political Entity (GPE) Experiments

Cold Start Slot Filling System The NYU 2011 Regular Slot Filling System

Cold Start Slot Filling System Adapt the NYU system to Cold Start 1.Within document coreference extract entities for a single document extract the longest name mention as the canonical mention – canonical mention: Maurice Sercarz – mention: Sercarz 2.Slot filling for GPEs infer slot fills from the extractions of person and organization entities

Cold Start Slot Filling System Adapt the NYU system to Cold Start 3.Contextual information extraction

Outline Cold Start Slot Filling System Entity Linking for Person and Organization Entity Linking for Geo-Political Entity (GPE) Experiments

Intelius Entity Linking Pipeline Blocking Top Level Blocking Sub-blocking Clustering Transitive Closure Graph Partition Machine Learning based Link Scoring Coalesce Records Person Profiles Goal: Conflate billions of entities Map Reduce Based Sequential file access Optimized for batch processing billions of records sequentially Optimization and compromises crucial to success

Blocking Bring together records likely to belong to the same entity Blocking Keys – Hash functions – Hand crafted and domain specific Equivalent classes of names and titles Contextual PER, ORG and GPE Keywords (TFIDF) – Dynamically selected

Link Scoring ADTree-based supervised model Training examples: – Sample selection: randomly and selectively (through active learning) – Labeling process: Three phases: – Amazon Mechanical Turk Labeling – Internal Data Rater Inspection – Researchers Multi-round of relabeling and inspection are needed if the quality of labels from Turkers is low – Size: 50,000 pairs for PER and 4,000 pairs for ORG

Features PER Feature Types (116 features): – General Demographic: Name frequency Birthday Location Population Combinations – Comparing KBP specific slots: Jobs Educations – TFIDF and N-gram: for contextual text information ORG Feature Types (60 features): – Location based – Comparing KBP specific slots – TFIDF and N-gram – for contextual text information

ORG ADTree Model (Partial)

Outline Cold Start Slot Filling System Entity Linking for Person and Organization Entity Linking for Geo-Political Entity (GPE) Experiments

GPE Disambiguation GPE (Toponyms) can be ambiguous – China: Country or Town in Maine, US – Georgia: Country or State in the US – Springfield: exists in more than 10 US States – Berlin: Capital of Germany, State in Germany, also common city name in the US – Over 5,000 ambiguous toponyms from geonames.orggeonames.org Use contextual GPE to disambiguate – Candidates with least cumulative spatial distance (Buscaldi and Rosso, 2008) – Voting schema with a hierarchical gazetteer

Hierarchical Gazetteer Country State/Province City/Town Gazetteer Sample KeyValue ChinaCountry_POP_1,330,044,000; City_InState_Maine_InCountry_US SeattleCity_InState_Washington_InCountry_US GeorgiaCountry_POP_4,630,000; State_POP_8,975,842_InCountry_US ……

Voting Schema Topo j ’s Vote for Candidate Topo i +3: if Topo i and Topo j are sibling cities e.g.: Austin, TX and Houston, TX +5: if Topo i and Topo j are sibling States e.g.: Georgia and Alabama +10: if Topo i is offspring of Topo j e.g.: Austin, TX and Texas +5: if Topo i is parent of Topo j e.g.: Washington and Seattle, WA

Outline Cold Start Slot Filling System Entity Linking for Person and Organization Entity Linking for Geo-Political Entity (GPE) Experiments

671 million Intelius People Profiles 671 million Intelius People Profiles 74+ million Topix News/blog articles 167+ million People Entities 26.5 million Conflated Blocking Top Level Blocking Sub- blocking Clustering Transitive Closure Graph Partition Machine Learning based Link Scoring Coalesc e Records Link News Profiles to Intelius Profiles Turker/Data Rater Evaluate: 8.06% were incorrectly conflated Blocking Top Level Blocking Sub-blocking Clustering Transitive Closure Graph Partition Machine Learning based Link Scoring Coalesce Records Person Profiles

Thanks!

?