Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore

The Web is a vast repository of human knowledge Basic premise

Diverse information spanning multiple verticals Wikipedia, Product, Business, People, …

Grand challenge Mine the Web to build knowledge bases (KBs) of people, places, things, events,… NameAddressPhone Chinese Mirch120 Lexington Ave (between 28th St & 29th St) New York, NY 10016 (212) 532-3663 CameraAspect Ratio Mega- pixels Canon Powershot 6004:30.5 Olympus D-300L4:30.8 Product NameList Price Sale Price Apple iPod nano 8 GB Black (5th Generation) $145.00$139.99 NameAffiliation# connections Rajeev RastogiYahoo! Labs Bangalore 142

What did search look like in the past?

Search results of the future: Structured abstracts yelp.com babycenter epicurious answers.com LinkedIn webmd New York Times Gawker

Rank by price Comparison shopping

Product near me

Topic entity pages CelebrityMusicVideos Related Topics Relevant Multi-media content including music, videos, information from Wiki pedia etc. A topic based page automatically generated in real time Up to the minute: Latest info using News feeds, blogs, Twitter, Flickr to stay up to date on Madonna

Noise Billions of pages with diverse structure, conflicting information, noise Building KBs from the Web is a hard problem yelp.com superpages.com

Page content/structure changes constantly Old New ~2% of sites change each day

KB creation pipeline Acquire content from the Web Extract structured data for entities from Web pages Identify and integrate data for each entity Roma Bistro Paris Information extractionContent acquisition Disambiguation & Integration

Reviews IE example Name Address Cuisine Phone Price NameAddressPhone Chinese Mirch 120 Lexington Ave New York, NY 10016 (212) 532-3663

Template-based Web pages From head/torso sites Pages have similar structure ~30% of crawled Web pages Information rich: 31% of search results

Hand-crafted pages Mainly from tail sites Pages have diverse structures

Browse pages Similar-structured records

Unstructured text

Web extraction landscape Site structurePage structure Structure Content Content Redundancy Content Features Context Pattern -based Wrapper Record Identification Content Matching Machine Learning Models Unstructured text Template- based pages Hand-crafted, browse pages Unstructured Snowball [AG 00] HCRF [ZNWZM 06] MLN [YCWZZM 09] RoadRunner [CMM 01] DEPTA [ZL 05] [KWD 97] [MMK 99] [GRST 10]

Web extraction landscape Site structurePage structure Structure Content Content Redundancy Content Features Context Pattern -based Wrapper Record Identification Content Matching ML Models Unstructured text Template- based pages Hand-crafted, browse pages Unstructured

Wrapper induction Learn Annotate Pages Sample pages Website pages Learn Rules Records XPath Rules Annotations Extract Website pages Cluster Technique for extraction from template-based pages Monitor Rules Apply Rules Site change

Clustering pages Group structurally similar pages using shingle signatures

Page shingle signature html body @id textarea @id div /div /textarea … br/ /body /html Windows Hash Min Tags Page signature: Vector of shingles Shingle: 5 5552030

Rule learning /html/body/div/div/div/div/div/div/span[@class=tel] //span[@class=tel] XPath Generalization

Learning robust XPaths //* //h1 //span //span[@class=tel] //*[@class=tel] SPECIALIZE Most general XPath that matches all the annotated values and none of the un- annotated values Most general XPath Use Apriori to generate candidate XPaths

Detecting site changes During Learn For each cluster, store the page signature and extracted record for a small number of pages Monitoring Crawl the pages daily and compare page signatures and extracted records Day 0 Signature & Record Match Day n Signature/ Record Mismatch Day m

Wrapper system deployed in Yahoo! 250M extractions from 200 sites (product, business) Avg num of clusters per site: 24 Avg num of pages annotated per cluster: 1.6

Limitations of wrappers Wont work across Web sites due to different page layouts Scaling to thousands of sites can be a challenge – Need to learn a separate wrapper for each site – Annotating example pages from thousands of sites can be time-consuming & expensive

Holy grail of IE research Unsupervised IE: Extract attribute values from pages of a new Web site without annotating a single page from the site OK to annotate pages from a few sites initially to create training data

Web extraction landscape Site structurePage structure Structure Content Content Redundancy Content Features Context Pattern -based Wrapper Record Identification Content Matching ML Models Unstructured text Template- based pages Hand-crafted, browse pages Unstructured

Key observation yelp.com superpages.com Web sites contain redundant content (that is, pages for same entity)

Content matching approach Step 1: Populate seed database from few initial sites NameAddress Chinese Mirrch120 Lexington Ave, New York, NY 10016 Tiffin Wallah127 E 28th St New York, NY 10079 Seed DB Wrappers

Content matching approach Step 2: Match values in page with seed record values NameAddress Chinese Mirrch120 Lexington Ave, New York, NY 10016 Tiffin Wallah127 E 28th St New York, NY 10079 Seed DB New site Web page

Content matching approach NameAddress Chinese Mirrch120 Lexington Ave, New York, NY 10016 Tiffin Wallah127 E 28th St New York, NY 10079 21 Club21 W 52nd St New York, NY 10019 Seed DB New site Web pages Step 3: Use matched values to extract records, expand seed database Wrappers New record

Key challenge 1 Diverse attribute value representations (impacts recall) NameAddress Chinese Mirrch120 Lexington Ave, New York, NY 10016 Tiffin Wallah127 E 28th St New York, NY 10079 Spelling error Variant

Key challenge 2 Noisy attribute value matches (impacts precision) NameAddress Chinese Mirrch120 Lexington Ave, New York, NY 10016 Tiffin Wallah127 E 28th St New York, NY 10079 Noisy match

Baseline similarity measure Use q-grams to handle spelling errors Weak Similarity = Cosine-similarity between IDF-weighted q-grams String3-grams chinese mirch { chi, hin, ine, nes, ese, se#, e#m, #mi, mir, irc, rch} chinese mirrch { chi, hin, ine, nes, ese, se#, e#m, #mi, mir, irr, rrc, rch} Weight of a q-gram (attribute-specific) = Sum of the IDFs of the words it appears in

Strong similarity Address (Seed DB)Address (Web site)WS 120 Lexington Avenue New York, NY 10016 120 Lexington Ave (between 28th and 29th St) New York, NY 10016 0.53 312 W 34th Street New York, NY 10001 312 W 34th St (between 8th and 9th Ave) New York, NY 10001 0.49 Strong similarity is defined between two sets of strings 1.Calculate the matching pattern between weakly similar pairs in the two sets 2.Pick matching patterns with sufficient support 3.Use only portions selected by the matching pattern in the final similarity calculation Templatized content

Computing matching pattern 120 Lexington Avenue New York NY 10016 120 Lexington Ave (Between 28 th And 29 th St) New York NY 10016 111111 1.Perform max-weight bipartite matching to find matching words Edge weight = Jaccard similarity over 3-grams 2.Form segments by grouping contiguous matching words 3.Assign each segment s i a label 0 if non-matching j if matching segment s j Matching pattern: 103 s1s1 s2s2 s3s3 s1s1 s2s2 s3s3 103 103

Strong similarity score computation Address (Seed)Address (Web site)Matching pattern SSMatching segments 120 Lexington Avenue New York, NY 10016 120 Lexington Ave (between 28th and 29th St) New York, NY 10016 103 1120 Lexington New York, NY 10016 312 W 34th Street New York, NY 10001 312 W 34th St (between 8th and 9th Ave) New York, NY 10001 103 1312 W 34th New York, NY 10001 Strong similarity: similarity between matching segments of values Support of matching pattern: # distinct matching segments Support(103 103) = 2 Strong similarity only computed for patterns with support

Need for support of a matching pattern Support(010 010): = 1 Hence Strong Similarity = Weak Similarity Address (Seed)Address (Web site)Matching pattern SSMatching segments 120 Lexington Avenue New York, NY 10016 1075 Fifth Ave New York, NY 10128 010 0.35New York, NY 312 W 34th Street New York, NY 10001 1167 Madison Ave New York, NY 10128 010 0.32New York, NY

Pruning noisy matches NameAddress Chinese Mirrch120 Lexington Ave, New York, NY 10016 Tiffin Wallah127 E 28th St New York, NY 10079 Match combinations of values in page Prune combinations that dont match attribute values in any seed record

X2 X1 X3 Apriori-style enumeration Round 1: (sup=2) Round 2: (sup=2) (sup=0) Prune attribute position combinations with low support – support = # pages in which values at positions match attribute values in a seed record

Experimental results Datasets Attributes RestaurantBibliography Name (core)Title (core) Address (core)Author (core) PhoneSource Payment Cuisine

Strong vs Weak similarity Extraction precision of WS and SS are comparable, precision increases with threshold Coverage of SS is steady wrt threshold, coverage of WS drops at high thresholds

Strong similarity scores SS boosts the similarity scores of TPs over a range of WS scores without boosting that of FPs String 1String 2WSSS 980 n michigan ave 14th floor chicago il 980 n michigan ave chicago il 60611 0.571 1100 e north ave west chicago il 60185 300 w north ave west chicago il 60185 0.74

Extraction Precision

Coverage Seed data size (Restaurant)

Summary Web is a vast repository of human knowledge Building (structured) knowledge base can improve search, help users find relevant information Key challenge: Unsupervised information extraction from Web pages Content redundancy on Web can be used for unsupervised extraction with high precision Future work – Handling numeric attributes, browse pages – Detecting and integrating records for the same entity

Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Similar presentations

Presentation on theme: "Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore.

Similar presentations

Presentation on theme: "Building Knowledge Bases from the Web Rajeev Rastogi Yahoo! Labs Bangalore."— Presentation transcript:

Similar presentations

About project

Feedback