LOGO 1 Corroborate and Learn Facts from the Web Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： 2008.03.06 Shubin Zhao, Jonathan Betz (KDD '07 )

Slides:

Advertisements

Similar presentations

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Advertisements

Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,

Bring Order to Your Photos: Event-Driven Classification of Flickr Images Based on Social Knowledge Date: 2011/11/21 Source: Claudiu S. Firan (CIKM’10)

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Information Retrieval in Practice

Aki Hecht Seminar in Databases (236826) January 2009

ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.

Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.

The PageRank Citation Ranking “Bringing Order to the Web”

Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

COMP305. Part II. Genetic Algorithms. Genetic Algorithms.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.

Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.

Overview of Search Engines

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

How to make a presentation (Oral and Poster) Dr. Bernard Chen Ph.D. University of Central Arkansas July 5 th Applied Research in Healthy Information.

Search Engines and Information Retrieval Chapter 1.

Matching names in parallel T. Hickey Access October.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

INTRODUCTION TO RESEARCH. Learning to become a researcher By the time you get to college, you will be expected to advance from: Information retrieval–

Querying Structured Text in an XML Database By Xuemei Luo.

A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:

Understanding and Predicting Personal Navigation Date : 2012/4/16 Source : WSDM 11 Speaker : Chiu, I- Chih Advisor : Dr. Koh Jia-ling 1.

1 Visual Segmentation-Based Data Record Extraction from Web IEEE Advisor ： Dr. Koh Jia-Ling Speaker ： Chou-Bin Fan Date ：

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

ON THE SELECTION OF TAGS FOR TAG CLOUDS (WSDM11) Advisor: Dr. Koh. Jia-Ling Speaker: Chiang, Guang-ting Date:2011/06/20 1.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.

BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hongkun Zhao, Weiyi.

Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.

WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

MapReduce ： Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.

Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.

Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.

Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.

Chapter 10 Algorithmic Thinking. Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.

SEMANTIC VERIFICATION IN AN ONLINE FACT SEEKING ENVIRONMENT DMITRI ROUSSINOV, OZGUR TURETKEN Speaker: Li, HueiJyun Advisor: Koh, JiaLing Date: 2008/5/1.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.

PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.

MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.

Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Data mining in web applications

Search Engine Architecture

Introduction Task: extracting relational facts from text

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Presentation transcript:

LOGO 1 Corroborate and Learn Facts from the Web Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Shubin Zhao, Jonathan Betz (KDD '07 )

2 Introduction  The web contains lots of interesting factual information about entities, such as celebrities, movies or products.  If we can collect them and provide a way to search them, it would be very helpful for answering questions or for improving web search in general.

3 Introduction  Fro example

4 Introduction  Facts are represented in the form of attribute- value pairs.  For instance “Birthday: June 4, 1975” and “Birth Name: Angelina Jolie Voight” are two facts for the entity “Angelina Jolie”.

5 MapReduce  The system described in this paper is called GRAZER, and MapReduce is the computing model the GRAZER system is based on.  MapReduce(OSDI’04) is a programming model for processing large data sets in parallel.  A MapReduce is composed of mappers and reducers.

6 MapReduce  Mapper ：  Mappers take the input data as key-value pairs.  The intermediate data is sorted by keys and then shuffled to reducers.  Reducer ：  Each reducer can process the key-value pairs again and output new values.  Intermediate values with the same key are always processed by one reduce step of a reducer.

7 The Grazer System  Definition ：  Fact: an attribute-value pair with a list of sources (urls) where the fact is mentioned.  Entity: formed by a list of facts.  Relevant page: a page that is relevant to an entity.  Pattern: it refers to any contiguous HTML tag sequence that repeats at least two times in a page.  Pattern instance: each repetition of a pattern.

8 The Grazer System  System Overview ：

9 The Grazer System  Generate Initial Facts ：  In this paper initial facts are generated by scraping en.wikipedia.org using manually-generated scripts.  Wikipedia facts are a good source because it covers many knowledge domains and this algorithm will corroborate and learn facts for each domain.

10 Retrieve Relevant Pages  The solution in GRAZER is to match anchor text of a page with entity names.  To improve the precision of the result, other heuristics can be used also.  One heuristic is to require the name to appear in the page title.  Another heuristic is to require the name to appear in a salient position in a page, such as in heading.

11 Retrieve Relevant Pages  The page retrieval algorithm is implemented in a MapReduce.  If the name can be found in the page, the mapper outputs the entity name as the key and content of the page as the value.  The reducer simply combines the pages for the same key (entity name) into a list.

12 Retrieve Relevant Pages  The MapReduce algorithm is as follows ：  Sometimes different entities share the same name, e.g. “Independence Day” can be a movie or a holiday.  For these entities, all the relevant pages are mixed together as they are indexed by the same entity name.

13 Corroborate Known Facts  Corroboration can be wrong with common facts such as gender, which have only two values “male” and “female”.  To avoid this kind of errors, here computes the probability p of all the fact values appearing randomly in a page given their attribute names.

14 Corroborate Known Facts  Corroboration Strategies ：  (1) Values of a fact tend to vary, e.g. “June 4, 1975”, “4-June-1975” and “06/04/1975”.  (2) Synonyms of attribute names can be used, e.g. “Date of Birth” and “Birthdate” for attribute “Birthday”.  (3) For some facts, the attribute name does not appear explicitly in text, e.g.

15 Corroborate Known Facts  Parallelization ：  The corroboration algorithm needs to corroborate each fact of the entity in the relevant pages.  This is implemented as a MapReduce. Reducer ： just passes input key-value pairs to output.

16 Extract New Facts  Pattern discovery ：  Pattern discovery is applied to the enclosing node of examples in structured text to find repeated HTML patterns.  It tries to find clusters of nodes under the same parent that have similar HTML format (or tags).

17 Extract New Facts  If multiple patterns are found under the same parent, the pattern with the largest span will be preferred.  E.g. if the html text is ： text text text text text the pattern discovered will be “ ” but not “ ”.  If a pattern can be matched and it contains an example fact, the extraction process will start to extract facts from it.

18 Extract New Facts  Fact extraction ：  If we can find an HTML pattern in which one pattern instance contains a example fact, it is likely that other pattern instances also contain facts about the same entity.  It will refer to the pattern instance that contains the corroborated fact as PIe.  If the number of textblocks in PIe is more than two: One of them contains the example attribute. Another one contains the example value.

19 Extract New Facts  The positions of the two corresponding textblocks in PIe are recorded. Attribute name appear in the first textblock (attribute index = 1). Value may appear in the second one (value index = 2).  When extracting from other pattern instances, we require that they must have the same number of textblocks as PIe.  New facts will be created from the extracted attribute- value pairs.

20 Extract New Facts  Bootstrapping ：  New facts extracted from a page are added to the seed entity.  Both the new facts and the original facts will be taken as seeds for corroboration in the following pages.

21 Extract New Facts  The bootstrapping process described here converges very well.  This is because incorrect facts extracted from one page are unlikely be corroborated in other pages.  Therefore the chance of error propagation is small.  In practice we can stop the algorithm when no more new facts can be extracted from any page.

22 Extract New Facts  Fact selection ：  The result of the GRAZER is an enlarged known facts set.  All corroborated facts have at least two sources and uncorroborated facts have only one source.  In general facts with more sources are more reliable.  To determine the quality of facts, other signals can also be used, such as the reliability and diversity of sources.

23 Experiment  Experiment on Wikipedia Facts ：  Seed facts ： Entity names are extracted from the first sentence of the page. It generated 1.75 million entities and 12.6 million facts.

24 Experiment  Relevant Pages ： The x-axis represents entities sorted by the number of relevant pages. The y-axis is the number of relevant pages in logarithmic scale.

25 Experiment  Learning results ： For entities with lots of relevant pages, bootstrapping with more rounds would generate more facts. It contains much redundant information about films, famous people and books.

26 Experiment  It is difficult to evaluate all the learning results because it involves millions of webpages and facts.  We think most of the facts do not get corroborated for two reasons ：  (1) There is less redundancy about less popular entities on the web.  (2) Many facts are specific to wikipedia and we can not find them in other sites by shallow string matching.

27 Conclusion  The paper presents an approach to find relevant pages about entities and extract new facts from them by corroborating existing facts.  Fact corroboration and extraction of each entity is a bootstrapping process which terminates well within a few learning rounds.

28  The algorithm is based on string match and HTML pattern discovery, so it is language- independent.  The corroboration can be based on the semantic values of facts.  This should be a better way to handle value variations, but it is language-dependent.