Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Slides:



Advertisements
Similar presentations
Panos Ipeirotis Stern School of Business
Advertisements

Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers New York University Stern School Victor Sheng Foster Provost Panos.
Internet Research: Whats hot in Search, Advertizing & Cloud Computing Rajeev Rastogi Yahoo! Labs Bangalore.
Taxonomy & Ontology Impact on Search Infrastructure John R. McGrath Sr. Director, Fast Search & Transfer.
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai, Lei Zhang and Wei-Ying Ma Chinese Academy of Sciences.
Knowledge Extraction from Technical Documents Knowledge Extraction from Technical Documents *With first class-support for Feature Modeling Rehan Rauf,
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Publish-Subscribe Approach to Social Annotation of News Top-k Publish-Subscribe for Social Annotation of News Joint work with: Maxim Gurevich (RelateIQ)
Benchmark Series Microsoft Excel 2013 Level 2
Indexing DNA Sequences Using q-Grams
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
WEB OF KNOWLEDGE 5.2
Chapter 5: Introduction to Information Retrieval
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
CMo: When Less Is More Yevgen Borodin Jalal Mahmud I.V. Ramakrishnan Context-Directed Browsing for Mobiles.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Problem Semi supervised sarcasm identification using SASI
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Aki Hecht Seminar in Databases (236826) January 2009
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Overview of Search Engines
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Annotating Search Results from Web Databases. Abstract An increasing number of databases have become web accessible through HTML form-based search interfaces.
Information Extraction Yahoo! Labs Bangalore Rajeev Rastogi Yahoo! Labs Bangalore.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Window-based models for generic object detection Mei-Chen Yeh 04/24/2012.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Presenter: Shanshan Lu 03/04/2010
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Information Retrieval in Practice
User Modeling for Personal Assistant
Search Engine Architecture
Web Data Extraction Based on Partial Tree Alignment
Lecture 9: Entity Resolution
Multimedia Information Retrieval
Kriti Chauhan CSE6339 Spring 2009
Ying Dai Faculty of software and information science,
Building Topic/Trend Detection System based on Slow Intelligence
Topic: Semantic Text Mining
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore

The Web is a vast repository of human knowledge Basic premise

Diverse information spanning multiple verticals Wikipedia, Product, Business, People, …

Grand challenge Mine the Web to build knowledge bases (KBs) of people, places, things, events,… NameAddressPhone Chinese Mirch120 Lexington Ave (between 28th St & 29th St) New York, NY (212) CameraAspect Ratio Mega- pixels Canon Powershot 6004:30.5 Olympus D-300L4:30.8 Product NameList Price Sale Price Apple iPod nano 8 GB Black (5th Generation) $145.00$ NameAffiliation# connections Rajeev RastogiYahoo! Labs Bangalore 142

What did search look like in the past?

Search results of the future: Structured abstracts yelp.com babycenter epicurious answers.com LinkedIn webmd New York Times Gawker

Rank by price Comparison shopping

Product near me

Topic entity pages CelebrityMusicVideos Related Topics Relevant Multi-media content including music, videos, information from Wiki pedia etc. A topic based page automatically generated in real time Up to the minute: Latest info using News feeds, blogs, Twitter, Flickr to stay up to date on Madonna

Noise Billions of pages with diverse structure, conflicting information, noise Building KBs from the Web is a hard problem yelp.com superpages.com

Page content/structure changes constantly Old New ~2% of sites change each day

KB creation pipeline Acquire content from the Web Extract structured data for entities from Web pages Identify and integrate data for each entity Roma Bistro Paris Information extractionContent acquisition Disambiguation & Integration

Reviews IE example Name Address Cuisine Phone Price NameAddressPhone Chinese Mirch 120 Lexington Ave New York, NY (212)

Template-based Web pages From head/torso sites Pages have similar structure ~30% of crawled Web pages Information rich: 31% of search results

Hand-crafted pages Mainly from tail sites Pages have diverse structures

Browse pages Similar-structured records

Unstructured text

Web extraction landscape Site structurePage structure Structure Content Content Redundancy Content Features Context Pattern -based Wrapper Record Identification Content Matching Machine Learning Models Unstructured text Template- based pages Hand-crafted, browse pages Unstructured Snowball [AG 00] HCRF [ZNWZM 06] MLN [YCWZZM 09] RoadRunner [CMM 01] DEPTA [ZL 05] [KWD 97] [MMK 99] [GRST 10]

Web extraction landscape Site structurePage structure Structure Content Content Redundancy Content Features Context Pattern -based Wrapper Record Identification Content Matching ML Models Unstructured text Template- based pages Hand-crafted, browse pages Unstructured

Wrapper induction Learn Annotate Pages Sample pages Website pages Learn Rules Records XPath Rules Annotations Extract Website pages Cluster Technique for extraction from template-based pages Monitor Rules Apply Rules Site change

Clustering pages Group structurally similar pages using shingle signatures

Page shingle signature html div /div /textarea … br/ /body /html Windows Hash Min Tags Page signature: Vector of shingles Shingle:

Rule learning XPath Generalization

Learning robust XPaths //* //h1 //span SPECIALIZE Most general XPath that matches all the annotated values and none of the un- annotated values Most general XPath Use Apriori to generate candidate XPaths

Detecting site changes During Learn For each cluster, store the page signature and extracted record for a small number of pages Monitoring Crawl the pages daily and compare page signatures and extracted records Day 0 Signature & Record Match Day n Signature/ Record Mismatch Day m

Wrapper system deployed in Yahoo! 250M extractions from 200 sites (product, business) Avg num of clusters per site: 24 Avg num of pages annotated per cluster: 1.6

Limitations of wrappers Wont work across Web sites due to different page layouts Scaling to thousands of sites can be a challenge – Need to learn a separate wrapper for each site – Annotating example pages from thousands of sites can be time-consuming & expensive

Holy grail of IE research Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site OK to annotate pages from a few sites initially to create training data

Web extraction landscape Site structurePage structure Structure Content Content Redundancy Content Features Context Pattern -based Wrapper Record Identification Content Matching ML Models Unstructured text Template- based pages Hand-crafted, browse pages Unstructured

Key observation yelp.com superpages.com Web sites contain redundant content (that is, pages for same entity)

Content matching approach Step 1: Populate seed database from few initial sites NameAddress Chinese Mirrch120 Lexington Ave, New York, NY Tiffin Wallah127 E 28th St New York, NY Seed DB Wrappers

Content matching approach Step 2: Match values in page with seed record values NameAddress Chinese Mirrch120 Lexington Ave, New York, NY Tiffin Wallah127 E 28th St New York, NY Seed DB New site Web page

Content matching approach NameAddress Chinese Mirrch120 Lexington Ave, New York, NY Tiffin Wallah127 E 28th St New York, NY Club21 W 52nd St New York, NY Seed DB New site Web pages Step 3: Use matched values to extract records, expand seed database Wrappers New record

Key challenge 1 Diverse attribute value representations (impacts recall) NameAddress Chinese Mirrch120 Lexington Ave, New York, NY Tiffin Wallah127 E 28th St New York, NY Spelling error Variant

Key challenge 2 Noisy attribute value matches (impacts precision) NameAddress Chinese Mirrch120 Lexington Ave, New York, NY Tiffin Wallah127 E 28th St New York, NY Noisy match

Baseline similarity measure Use q-grams to handle spelling errors Weak Similarity = Cosine-similarity between IDF-weighted q-grams String3-grams chinese mirch { chi, hin, ine, nes, ese, se#, e#m, #mi, mir, irc, rch} chinese mirrch { chi, hin, ine, nes, ese, se#, e#m, #mi, mir, irr, rrc, rch} Weight of a q-gram (attribute-specific) = Sum of the IDFs of the words it appears in

Strong similarity Address (Seed DB)Address (Web site)WS 120 Lexington Avenue New York, NY Lexington Ave (between 28th and 29th St) New York, NY W 34th Street New York, NY W 34th St (between 8th and 9th Ave) New York, NY Strong similarity is defined between two sets of strings 1.Calculate the matching pattern between weakly similar pairs in the two sets 2.Pick matching patterns with sufficient support 3.Use only portions selected by the matching pattern in the final similarity calculation Templatized content

Computing matching pattern 120 Lexington Avenue New York NY Lexington Ave (Between 28 th And 29 th St) New York NY Perform max-weight bipartite matching to find matching words Edge weight = Jaccard similarity over 3-grams 2.Form segments by grouping contiguous matching words 3.Assign each segment s i a label 0 if non-matching j if matching segment s j Matching pattern: 103 s1s1 s2s2 s3s3 s1s1 s2s2 s3s

Strong similarity score computation Address (Seed)Address (Web site)Matching pattern SSMatching segments 120 Lexington Avenue New York, NY Lexington Ave (between 28th and 29th St) New York, NY Lexington New York, NY W 34th Street New York, NY W 34th St (between 8th and 9th Ave) New York, NY W 34th New York, NY Strong similarity: similarity between matching segments of values Support of matching pattern: # distinct matching segments Support( ) = 2 Strong similarity only computed for patterns with support

Need for support of a matching pattern Support( ): = 1 Hence Strong Similarity = Weak Similarity Address (Seed)Address (Web site)Matching pattern SSMatching segments 120 Lexington Avenue New York, NY Fifth Ave New York, NY New York, NY 312 W 34th Street New York, NY Madison Ave New York, NY New York, NY

Pruning noisy matches NameAddress Chinese Mirrch120 Lexington Ave, New York, NY Tiffin Wallah127 E 28th St New York, NY Match combinations of values in page Prune combinations that dont match attribute values in any seed record

X2 X1 X3 Apriori-style enumeration Round 1: (sup=2) Round 2: (sup=2) (sup=0) Prune attribute position combinations with low support – support = # pages in which values at positions match attribute values in a seed record

Experimental results Datasets Attributes RestaurantBibliography Name (core)Title (core) Address (core)Author (core) PhoneSource Payment Cuisine

Strong vs Weak similarity Extraction precision of WS and SS are comparable, precision increases with threshold Coverage of SS is steady wrt threshold, coverage of WS drops at high thresholds

Strong similarity scores SS boosts the similarity scores of TPs over a range of WS scores without boosting that of FPs String 1String 2WSSS 980 n michigan ave 14th floor chicago il 980 n michigan ave chicago il e north ave west chicago il w north ave west chicago il

Extraction Precision

Coverage Seed data size (Restaurant)

Summary Web is a vast repository of human knowledge Building (structured) knowledge base can improve search, help users find relevant information Key challenge: Unsupervised information extraction from Web pages Content redundancy on Web can be used for unsupervised extraction with high precision Future work – Handling numeric attributes, browse pages – Detecting and integrating records for the same entity