Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Slides:

Advertisements

Similar presentations

Improved TF-IDF Ranker

Advertisements

Leveraging Community-built Knowledge For Type Coercion In Question Answering Aditya Kalyanpur, J William Murdock, James Fan and Chris Welty Mehdi AllahyariSpring.

Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

Information Retrieval in Practice

Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.

Search Engines and Information Retrieval

Aki Hecht Seminar in Databases (236826) January 2009

The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.

Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.

CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.

Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

Overview of Search Engines

Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.

Mining and Summarizing Customer Reviews

AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.

Search Engines and Information Retrieval Chapter 1.

An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.

Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.

Survey of Semantic Annotation Platforms

Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.

Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL

PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

Querying Structured Text in an XML Database By Xuemei Luo.

11 A Hybrid Phish Detection Approach by Identity Discovery and Keywords Retrieval Reporter: 林佳宜 /10/17.

A Language Independent Method for Question Classification COLING 2004.

21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Algorithmic Detection of Semantic Similarity WWW 2005.

LOGO 1 Corroborate and Learn Facts from the Web Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Shubin Zhao, Jonathan Betz (KDD '07 )

Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.

LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hongkun Zhao, Weiyi.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.

Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Of 24 lecture 11: ontology – mediation, merging & aligning.

A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.

Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,

Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.

Information Retrieval in Practice

Search Engine Architecture

Text Based Information Retrieval

Social Knowledge Mining

Extracting Semantic Concept Relations

Introduction to Information Retrieval

Enriching Taxonomies With Functional Domain Knowledge

Presentation transcript:

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04

Introduction The recognition of names and their associated categories relies on semantic lexicons and gazetteers traditionally. They present a lightly supervised method for acquiring named entities in arbitrary categories. The method applies lightweight lexicon-syntactic extraction patterns to the unstructured text of Web documents. Four differences from traditional approaches:  Not requiring any start-up seed names or training data  Not encoding any domain knowledge in its extraction patterns  It is only lightly supervised, and data-driven  It does not impose any a-prioir restriction on the categories of extracted names

Problems On the input side, users submit queries that usually contain a small number of words. On the output side, there are quantitative limits since most users actually inspect only the first few documents in the search result set. Beyond measurements like term frequencies or term proximity, the occurrence of named entities signals prominent pieces of information. Users might search for lists of names, popular names, unfamiliar names. Names and their categories are hidden within the unstructured text of Web documents.

Extracting Categorized Names Identify names and their categories within unstructured text. Lightweight Extraction Method  Document pre-processing  Extraction of categorical facts  Derivation of categories Extension  Extraction of Similar Names  Automatically-Derived Patterns

Lightweight Extraction Method - Document Pre-processing Filtering out HTML tags, the input documents are tokenized, split into sentences and part-of-speech tagged using TnT tagger Each sequence of capitalized terms in the sentence is marked as a potential instance name

Lightweight Extraction Method - Extraction of Categorical Fact A categorical fact is a sentence nugget that is likely to provide explicitly the category of the associated instance name The fact and associated instance name are captured with a set of patterns which can be summarized as :  [StartOfSent] X [such as | including] N [and|,|.]  N is the potential instance name  X is the categorical fact The matching of the patterns in the sentences results in pairs (X,N) of categorical facts and instance names “That is because software firewalls, including Zone Alarm, offer some semblance of this feature.” All potential instance names that are not associated with a categorical fact are discarded

Lightweight Extraction Method - Derivation of Categories The categorical facts X of the pairs (X,N) are searched for the noun phrase which encodes the category of the associated name The phrase is approximated by the rightmost non-recursive noun phrase whose last component is a plural-form noun (1) No plural-form noun phrase exists near the end of the categorical fact, the pair (X,N) is discarded (2) A plural-form noun phrase exists near the end of the categorical fact, but it is immediately (e.g., within 5 tokens) preceded by another plural-form noun phrase, the pair is discarded (3) The noun phrase is retained as the lexicalized category of the instance name N Non-informative 20 most-frequent modifiers computed statistically in a post-processing phase over the entire set of categories (programming languages rather than other programming languages)

Lightweight Extraction Method - Derivation of categories

Extensions - Extraction of Similar Names To support multiple-name extraction, the patterns are slightly modified to match enumerations (N1[,N2,…and Nm]) in addition to single names (N).

Extensions - Automatically-Derived Patterns Identifying the additional patterns to increase coverage Match pairs (C,N) of categories and names, extracted in the previous iteration, back into text sentences The form of potential pattern: Duplicates of potential patterns that occur for the same category C and name N are discarded The process continues with a next iteration

Categorized Names as Lexical Knowledge Each categorized name corresponds to an InstanceOf assertion between the name and the category Find related categories Expand existing knowledge resources

Category Relatedness as Set Overlap An IsA semantic relation between two categories will be reflected in their sets of instance names In practice the instance name sets are partial rather than complete Overlap between two sets indicate a strong relation between the corresponding categories Empty or very small overlaps correspond to categories that are not directly related (search engines vs languages) Medium or large-sized overlaps indicate related categories (search engines vs internet portals)

Category Relatedness as Set Overlap

Integration into Existing Knowledge Resources An immediate extension to any knowledge resources that organize English concepts hierarchically is to map the extracted names into new InstanceOf assertions. The new assertion corresponds to a new instance node being linked to an existing node at the bottom of the hierarchies. The process is straightforward only if there is exactly one possible insertion point. In general case, the name belongs to multiple categories, each of which matches zero, one, or multiple WordNet concepts.

Inserting Algorithm Map the extracted names into new InstanceOf assertions. The insertion algorithm links a name to at most one WordNet concept. This algorithm matches each category of the input name against WordNet concept. If a category does not match any concept, its modifiers are discarded until one or more matches are found. High-level programming languages -> programming language Each pair of related categories is represented as an additional RelatedTo link in the existing hierarchical structure.

Applications to Web Search Processing list-type queries  When the input query matches a known category, the search results also include the top names as representative elements of that category Retrieval of siblings  Users don’t know the names, but know the similar one  Retrieval of similar names, or sibling names, generally anchors the name into a set of possibly-known names, thus guiding the users in their search Query refinement suggestions  A set of related queries is generated and offered to the user as suggestions for refinement  Each related query consists of the name and one of its categories  If the input query is a known name  If the input query is a known category

Evaluation Results (1/2) Data Two document collections: Web news articles (NewsData 12 million) and Web documents (WebData 500 million) Results

Evaluation Results (2/2) Due to the diversity of the acquired categories, we do not currently have a complete qualitative evaluation in terms of precision and recall of all acquired names. An average precision is 88%. The precision was computed over the top 50 instance names of 20 randomly-selected, compound-noun categories.

Conclusion (1/2) This paper presented a lightly supervised method for accessing, decoding and exploiting a very small part of the information. The collected categories of names effectively fuse and summarize semantic relations detected within initially-isolated documents. To enhance exist knowledge resources, the acquired categorized names also enable novel Web search applications. In addition to increase precision and recall, we will explore other clues for finding candidate names.

Conclusion (2/2) Word capitalization is the only clue used to detect possible names in text.  Cannot cover names containing numbers  Doesn’t generalize to other languages The extraction patterns used in the paper focus on categorical facts. Descriptive facts are a source of definitions when inserting names in existing knowledge resources like WordNet.