Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,

Slides:



Advertisements
Similar presentations
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Advertisements

Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
Modelled on paper by Oren Etzioni et al. : Web-Scale Information Extraction in KnowItAll System for extracting data (facts) from large amount of unstructured.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Graph-Based Methods for “Open Domain” Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Search Engines and Information Retrieval
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
Iterative Set Expansion of Named Entities using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University.
Automatic Set Expansion for List Question Answering Richard C. Wang, Nico Schlaefer, William W. Cohen, and Eric Nyberg Language Technologies Institute.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
Character-Level Analysis of Semi-Structured Documents for Set Expansion Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Language Technologies Institute, Carnegie Mellon University Language-Independent Class Instance Extraction Using the Web Richard C. Wang Thesis Committee:
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Search Engines and Information Retrieval Chapter 1.
Carmen Banea, Rada Mihalcea University of North Texas A Bootstrapping Method for Building Subjectivity Lexicons for Languages.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
1 LiveClassifier: Creating Hierarchical Text Classifiers through Web Corpora Chien-Chung Huang Shui-Lung Chuang Lee-Feng Chien Presented by: Vu LONG.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
Collectively Representing Semi-Structured Data from the Web Bhavana Dalvi, William W. Cohen and Jamie Callan Language Technologies Institute, Carnegie.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Never-Ending Language Learning for Vietnamese Student: Phạm Xuân Khoái Instructor: PhD Lê Hồng Phương Coupled SEAL.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei Dept. of Computer Science, Princeton University, USA CVPR ImageNet1.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Named Entity Recognition in Query Jiafeng Guo 1, Gu Xu 2, Xueqi Cheng 1,Hang Li 2 1 Institute of Computing Technology, CAS, China 2 Microsoft Research.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long.
The Big Picture Chapter 3. A decision problem is simply a problem for which the answer is yes or no (True or False). A decision procedure answers a decision.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Hierarchical Semi-supervised Classification with Incomplete Class Hierarchies Bhavana Dalvi ¶*, Aditya Mishra †, and William W. Cohen * ¶ Allen Institute.
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
Web News Sentence Searching Using Linguistic Graph Similarity
An Empirical Study of Learning to Rank for Entity Search
Web Information Extraction
Searching with context
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
KnowItAll and TextRunner
Presentation transcript:

Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh, PA USA

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 2 / 21 Challenge Discovering set instances, or hyponyms, of any given semantic class name  x is a hyponym of y if x is a (kind of) y 2 / 21 Automatic Set Instance Extraction using the Web “Failed Banks” “Bags” These are real examples from our system described in this paper “Hair Styles”

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 3 / 21 Automatic Set Instance Extraction using the Web Outline Background – SEAL Proposed Approach – ASIA Evaluation Conclusion

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 4 / 21 Automatic Set Instance Extraction using the Web Background – SEAL Set Expander for Any Language  Wang & Cohen, ICDM 2007 An example of set expansion  Given an input query (seeds): { survivor, amazing race }  The output answer is: { american idol, big brother,... } A well-known SE system is Google Sets™ 

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 5 / 21 Automatic Set Instance Extraction using the Web Background – SEAL Features  Independent of human & markup language Support seeds in English, Chinese, Japanese, Korean,... Accept documents in HTML, XML, SGML, TeX, WikiML, …  Does not require pre-annotated training data Utilize readily-available corpus: World Wide Web Research contributions  Automatically construct wrappers for extracting candidate items  Rank candidates using random walk

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 6 / 21 Automatic Set Instance Extraction using the Web SEAL’s Pipeline Canon Nikon Olympus Pentax Sony Kodak Minolta Panasonic Casio Leica Fuji Samsung … Fetcher: Download web pages containing all seeds Extractor: Construct wrappers for extracting candidate items Ranker: Rank candidate items using Random Walk

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 7 / 21 Can you find common contexts that bracket every seed instance? I guess not! Let’s try our Extractor … Our Extractor finds maximally-long contexts that bracket at least one instance of every seed

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 8 / 21 Automatic Set Instance Extraction using the Web Outline Background – SEAL Proposed Approach – ASIA Evaluation Conclusion

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 9 / 21 Proposed Approach – ASIA Noisy Instance Provider Noisy Instance Expander Bootstrapper Semantic Class Name Noisy Instances Some Instances More Instances Automatic Set Instance Acquirer (ASIA)

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 10 / 21 Rank each candidate i in I based on  # of patterns, snippets, and excerpts containing i (more = better)  # of characters between i and C in every excerpt (fewer = better) Noisy Instance Provider (NIP) Manually constructed hyponym patterns based on Marti Hearst’s work in 1992 Query search engines for each hyponym pattern + a class name  e.g. “car makers such as” Extract all candidates I from returned web snippets A snippet often contains multiple excerpts

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 11 / 21 Noisy Instance Expander (NIE) The Extractor in NIE is a variation of that used in SEAL Performs set expansion on web pages queried by a class name + some list words  List words are words that often appear on list-containing pages  Example query: “car makers” (list OR names OR famous OR common) SEAL’s ExtractorNIE’s Extractor Requires the longest common contexts to bracket at least one instance of every seed per web page Requires the common contexts that bracket the most unique seeds to be as long as possible per web page

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 12 / 21 Bootstrapper An iterative version of SEAL (iSEAL)  Wang & Cohen, ICDM 2008 iSEAL makes several calls to SEAL. In each call, iSEAL…  expands a few seeds, and  aggregates statistics

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 13 / 21 Automatic Set Instance Extraction using the Web Bootstrapper Initial Seeds Used Seeds

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 14 / 21 Automatic Set Instance Extraction using the Web Outline Background – SEAL Proposed Approach – ASIA Evaluation Conclusion

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 15 / 21 Evaluation Datasets 36 datasets and each of their class names used as input to ASIA

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 16 / 21 Evaluation Results

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 17 / 21 Comparison to: Kozareva, Riloff, and Hovy, ACL 2008 Input to Kozareva: a class name + a seed

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 18 / 21 Definition:  Original WN – WordNet 2.1  Extended WN – Snow’s (+30K) extension of WN 2.1 Selecting semantic classes for evaluation:  In Extended WN hierarchy, focus on leaf semantic classes extended by Snow that have ≥ 3 hyponyms  Filter out those classes if the hyponyms from ASIA do not overlap with more than half of the hyponyms in the Original WN  Randomly select a dozen remaining classes 18 / 21 Automatic Set Instance Extraction using the Web Comparison to: Snow et al., ACL 2006

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 19 / 21 Comparison to: Snow et al., ACL 2006

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 20 / 21 Conclusion ASIA is nearly language-independent  Can be easily extended to support other languages by adding a few hyponym patterns ASIA outperforms other English systems  Even though some of those use more input than just a semantic class name ASIA is quite efficient  Requiring only a few seconds per problem on a single-CPU machine

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 21 / 21 Automatic Set Instance Extraction using the Web The End – Thank You! Try out Boo!Wa! at  Send any feedback to:

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 22 / 21 Automatic Set Instance Extraction using the Web Evaluation Method Evaluation metric: Mean Average Precision  Contains recall and precision-oriented aspects  Sensitive to the entire ranking Evaluation procedure:  Input a semantic class name to ASIA  Compute MAP for the output list

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 23 / 21 Comparison to Pasca, CIKM 2007

Language Technologies Institute, Carnegie Mellon University Richard C. Wang 24 / 21 Automatic Set Instance Extraction using the Web Evaluation Method Mean Average Precision  Commonly used for evaluating ranked lists in IR  Contains recall and precision-oriented aspects  Sensitive to the entire ranking  Mean of average precisions for each ranked list Evaluation Procedure (per combination of iterative process, seeding strategy, and ranker – 20 in total) 1. Perform 10 iterative expansions on each of the 36 datasets 3 times 2. At each iteration, compute MAP for the 108 (3 x 36) ranked lists where L = ranked list of extracted items, r = rank If a list contains multiple synonyms of an entity e, then we only evaluate e once. A binary function that returns 1 iff (a) and (b) are true: (a) Synonym at r is correct (b) It’s the highest-ranked synonym of its entity in the list