Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.

Slides:



Advertisements
Similar presentations
PAC Learning 8/5/2005. purpose Effort to understand negative selection algorithm from totally different aspects –Statistics –Machine learning What is.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Web Scale Taxonomy Cleansing Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang VLDB 2011.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
Predicting the Semantic Orientation of Adjectives
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Introduction to Machine Learning Approach Lecture 5.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Short Text Understanding Through Lexical-Semantic Analysis
ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
A Two Tier Framework for Context-Aware Service Organization & Discovery Wei Zhang 1, Jian Su 2, Bin Chen 2,WentingWang 2, Zhiqiang Toh 2, Yanchuan Sim.
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Naive Bayes Classifier
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Wong Cheuk Fun Presentation on Keyword Search. Head, Modifier, and Constraint Detection in Short Texts Zhongyuan Wang, Haixun Wang, Zhirui Hu.
A Language Independent Method for Question Classification COLING 2004.
1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Inference Protocols for Coreference Resolution Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Nick Rizzolo, Mark Sammons, and Dan Roth This research.
Ranking Related Entities Components and Analyses CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Recognizing Stances in Online Debates Unsupervised opinion analysis method for debate-side classification. Mine the web to learn associations that are.
Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
Post-Ranking query suggestion by diversifying search Chao Wang.
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
Concept-based Short Text Classification and Ranking
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Ontology Evaluation Outline Motivation Evaluation Criteria Evaluation Measures Evaluation Approaches.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Queensland University of Technology
Ranking and Learning 290N UCSB, Tao Yang, 2014
Presentation 王睿.
iSRD Spam Review Detection with Imbalanced Data Distributions
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
ProBase: common Sense Concept KB and Short Text Understanding
Topic: Semantic Text Mining
Presentation transcript:

Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo Xu

Background Goal of Creating Knowledge Base Knowledge base is used to provide this background knowledge to machines Enable machines to perform certain types of inferences as humans do Three backbone of the knowledge base IsA: a relationship between a sub-concept and a concept e.g., an IT company is a company isInstanceOf: a relationship between an entity and a concept e.g., Microsoft is a company isPropertyOf: a relationship between an attribute and a concept e.g., color is a property/attribute of wine

Attributes(isPropertyOf) play a critical role in knowledge inference Knowledge Inference Given some attributes, guess which concept they described E.g. “capital city, population”, people may infer that it relates to country Given “color, body, smell,” people may think of wine However, in many cases, the association may not be straightforward, and may require some guesswork even for humans

However, in order to perform inference, knowing that a concept can be described by a certain set of attributes is not enough. Instead, we must know how important, or how typical each attribute is for the concept Typically P(c|a) Denotes how typical concept c is, given attribute a P(a|c) Denotes how typical attribute a is, given concept c

Problem Definition Goal

Framework

Attribute extraction from the web corpus Concept-based extraction(CB) E.g., “... the acidity of a wine is an essential component of the wine...” Instance-based extraction(IB) E.g., “the taste of Bordeaux is...” (c, a, n(c, a)) tuples (i, a, n(i, a)) tuples

Attribute extraction from an external knowledgebase Extract instance-based attributes from DBpedia entities(KB) use n(i, a) = 1 to produce (i, a, n(i, a)) tuples

Attribute extraction from the query logs Extract instance-based attributes from query logs (QB) we cannot expect concept-based attributes from query logs as people usually have interest in a more specific instance rather than a general concept Two-pass approach First pass Extract candidate instance-attribute pairs (i, a) from the pattern “the a of (the/a/an) i” extend the candidate list IU to include the attributes obtained from IB and KB Second pass count the co-occurrence n(i, a) of i and a in the query log for each (I, a) ∈ IU to produce the (i, a, n(i, a)) list

Attribute Distribution e.g., the attribute name is frequently observed from the CB pattern, such as “the name of a state is...”, but people will not mention it with the specific state, as in “the name of Washington is...”.

Attribute Scoring Computing Typicality from a CB List Computing Typicality from an IB List

Computing Naive P(a|i, c) and P(i|c) using Probase A simplifying assumption one instance belongs to one concept P(c|i) = 1 if this concept-instance is observed in Probase and 0 otherwise

However, in reality, the same instance name appears with many concepts [C1] ambiguous instance related to dissimilar concepts ‘Washington’ can refer to a president and a state [C2] unambiguous instance related to similar concepts ‘George Washington’ is a president, patriot, and historical figure

Unbiasing P(a|i, c) JC(a, c) (join count) defined as the number of instances in c that has attribute join ratio (JR) representing how likely a is associated with c

Unbiasing P(c|i) Probase frequency counts n p (c, i) However, using this equation would not distinguish C2 from C1 sim(c, c′) comparing the instance sets of the two concepts using Jaccard Similarity

Unbiasing P(c|i) Probase frequency counts n p (c, i) However, using this equation would not distinguish C2 from C1 sim(c, c′) comparing the instance sets of the two concepts using Jaccard Similarity

Typicality Score Aggregation P(a|c) One CB list from web documents Three IB lists from web documents, query log, and knowledge base no single source can be most reliable for all concepts Formulate P(a|c) The goal is to learn weights for the given concept

Learn the weights Ranking SVM approach pairwise learning VS. regression Since training data does not have to quantify absolute typicality score

Pairwise Learning Given x1>x2>x3 Binary Classification Positive example: x1-x2, x1-x3, x2-x3 Negative example: x2-x1, x3-x1, x3-x2

Features linear combination of the five features

Attribute Synonym Set Leverage the evidence from Wikipedia Wikipedia Redirects Wikipedia Internal Links we take each connected component as a synonymous attribute cluster Within each cluster, we set the most frequent attribute as a representative attribute of the synonymous attributes

Experiment——Experimental Settings

Experiment——Evaluation Measures and Baseline The number of labeled attributes for evaluation is 4846 attributes from 12 concepts evaluation measures relevance score to the four groups as 1, 2/3, 1/3, and 0, respectively absolute value is hard to acquire, approximate this value consider a set of attributes labeled as the universe

Precision

Recall

Thank You!