Presentation on theme: "Mianwei Zhou, Kevin Chen-Chuan Chang University of Illinois at Urbana-Champaign Entity-Centric Document Filtering: Boosting Feature Mapping through Meta-Features."— Presentation transcript:
Mianwei Zhou, Kevin Chen-Chuan Chang University of Illinois at Urbana-Champaign Entity-Centric Document Filtering: Boosting Feature Mapping through Meta-Features
Much of the Information Sought on the Web nowadays is about Entities. 2 The Web A Huge Entity Database We love George!! OMG! IPad Air is coming out~~ How to improve our products quality? TREC-KBA Task How to help Wikipedia editors enrich Wikipedia? TREC-KBA Task How to help Wikipedia editors enrich Wikipedia?
Proposal: Entity-Centric Document Filtering System 3
Entity-Centric Document Filtering System: Automatically Identify Relevant Documents for Entities Billions of News, blogs, forums, tweets... entity-centric document filtering system Interested Entities Irrelevant Documents Relevant Documents 4
INPUT: Only Entity Name is Usually Insufficient. 5
INPUT: Use Identification Page to Characterize the Target Entity. Entity Identification Pages 1.Resolve the ambiguity problem. 2.Provide more information about the entity 6
OUTPUT: Relevant/Irrelevant Documents for Target Entities. Bill Gates Michael Jordan (NBA Player) RelevantIrrelevant Bill Gates, speaking as co- founder of Microsoft, will give a talk next Tuesday... Steve Jobs story is completely different from Bill Gates... Michael Jordan is considered by many the best basket player in NBA history Michael Jordan is a Leading researcher in machine learning and AI. Michael Jordan is a Leading researcher in machine learning and AI. 7
Problem: Entity-Centric Learning to Filter Training Phase Testing Phase Wiki Page RelevantIrrelevant Wiki Page RelevantIrrelevant Entity-centric Document Filter Wiki Page ? ??? 9
How to Predict Document Relevance for an Entity Characterized by an Identification Page? Traditional IR models such as BM25, language model do not work. Designed for Short Queries Entity Pages contain many Noisy Keywords 10
Our Idea: Check if the document mentions about the most basic information of the entity. Microsoft Windows Seattle Philanthropist 11
For an Entity with Labeled Documents, Learning its Important Keywords is Simple. Relevant DocumentIrrelevant Document Bill Gates, speaking as co- founder of Microsoft, will give a talk next Tuesday... Steve Jobs story is completely different from Bill Gates... 13 Relevance of document d for entity e
However, Such Keyword Importance is Not Adaptable to Other Entities. Microsoft Windows Seattle Philanthropist NBA Chicago Bull MVP UNC Training Entities (with Labeled Documents) New Entities (without Labeled Documents) Keyword Importance Transfer 14
Keyword: Microsoft Keyword: Chicago Bull 1.are mentioned a lot in their Wiki Pages. 2.are organization. 3.appear in the info-box..... Similar Importance 16 Both of them...
Meta-Feature -- Features of Features: Properties that are related to keyword importance 17 General Meta-Feature IDF, IsNoun, InEntity,... ID-Page-Related Meta-Feature Wiki Page InInfobox, InOpenPara,... Amazon Page InSpec, InReview,...
Clustering-based Keyword Mapping 19 Training Phase Microsoft Harvard Cascade Hollywood NKU CFR... the is this a here as the... Testing Phase NBA UNC Bobcats Wiki the must there... NBA UNC Bobcats Wiki the must there...
Document Relevance based on Keyword Clusters 20 Keyword Clusters Keyword Importance
Traditional Clustering Algorithm Might Fail 21... the WA for October programmer consistentlyMS Oscar actor is Occupation Hollywood screenwriter 1. Irrelevant Meta-Features might Lead to Useless Clusters 2. Different Possible Ways of Clustering. Which one is better? OR ?
BoostMapping: Boosting Effective Clusters 22 Microsoft Harvard Cascade Hollywood NKU CFR... the is this a here as the Document Labels Objective of Clustering: Boosting the Prediction Accuracy of Relevance Only Useful Clusters are Generated.
Three Datasets 29 TREC-KBA 29 person entities, 52,238 documents Wikipedia pages as ID pages Product 39 product entities, 2,398 documents Amazon pages as ID pages MilQuery (From Million Query Track) 143 general entities, 8,208 documents. Wikipedia pages ad ID pages. Hostage Rescue Kodak Dinosaur
Performance Comparison with Baselines 30 QueryByName: Use Entity Names As Queries QBD-TFIDF: Use TFIDF to Select Important Keywords as Queries. VectorSim: Measure Relevance Based on Query-Document SimilarityLinearMapping: Keyword Mapping based on a Linear Function.