Presentation is loading. Please wait.

Presentation is loading. Please wait.

Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay Contributors Rahul Gupta Girija Limaye.

Similar presentations


Presentation on theme: "Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay Contributors Rahul Gupta Girija Limaye."— Presentation transcript:

1 Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita Contributors Rahul Gupta Girija Limaye Prashant Borole Rakesh Pimplikar Aditya Somani

2 Web Search Mainstream web search User Keyword queries Search engine Ranked list of documents 15 glorious years of serving all of users search need into this least common denominator Structured web search User Natural language queries ~/~ Structured queries Search engine Point answer, record sets Many challenges in understanding both query and content 15 years of slow but steady progress 2

3 The Quest for Structure Vertical structured search engines Structure Schema Domain-specific Shopping: Shopbot: (Etzoini + 1997) Product name, manufacturer, price Publications: Citeseer (Lawrence, Giles,+ 1998) Paper title, author name, email, conference, year Jobs: Flipdog Whizbang labs (Mitchell + 2000) Company name, job title, location, requirement People: DBLife (Doan 07) Name, affiliations, committees served, talks delivered. Triggered much research on extraction and IR-style search of structured data (BANKS 02). 3

4 Horizontal Structured Search Domain-independent structure Small, generic set of structured primitives over entities, types, relationships, and properties IsA Mysore is a city Has Average rainful born-in CEO-of 4

5 Types of Structured Search Web+People Structured databases ( Ontologies) Created manually (Psyche), or semi-automatically (Yago) True Knowledge (2009), Wolfram Alpha (2009) Web annotated with structured elements Queries: Keywords + structured annotations Example: +cosmos Open-domain structure extraction and annotations of web docs (2005) 5

6 Users, Ontologies, and the Web Users are from Venus Bi-syllabic, impatient, believe in mind -reading Ontologies are from Mars One structure to fit all G Web content creators are from some other galaxy – Ontologies= – Let search engines bring the users

7 What is missed in Ontologies The trivial, the transient, and the textual Procedural knowledge What do I do on an error? Huge body of invaluable text of various type reviews, literature, commentaries, videos Context By stripping knowledge to its skeletal form, context that is so valuable for search is lost. As long as queries are unstructured, the redundancy and variety in unstructured sources is invaluable.

8 Structured annotations in HTML Is A annotations KnowITAll (2004) Open-domain Relationships Text runner (Banko 2007) Ontological annotations SemTag and Seeker (2003) Wikipedia annotations (Wikify! 2007, CSAW 2009) 8 All view documents as a sequence of tokens Challenging to ensure high accuracy

9 WWT: Table queries over the semi- structured web 9

10 Queries in WWT Query by content Query by description 10 Alan TuringTuring Machine E. F. CoddRelational Databases DeshLate night BhairaviMorning PatdeepAfternoon InventorComputer science conceptYear Indian statesAirport City

11 11 Answer: Table with ranked rows PersonConcept/Invention Alan TuringTuring Machine Seymour CraySupercomputer E. F. CoddRelational Databases Tim Berners-LeeWWW Charles BabbageBabbage Engine

12 12 Verbose articles, not structured tables The only document with an unstructured list of some desired records Desired records spread across many documents Correct answer is not one click away. Computer science concept inventor year Keyword search to find structured records

13 13 The only list in one of the retrieved pages

14 14 Highly relevant Wikipedia table not retrieved in the top-k Ideal answer should be integrated from these incomplete sources

15 15 Attempt 2: Include samples in query Documents relevant only to the keywords Ideal answer still spread across many documents Known examples alan turing machine codd relational database

16 16 WWT Architecture Index Query Builder Web Extract record sources Query Table Content+ context index Offline Store Keyword Query Source L 1,…,L k Type Inference Resolver Resolver builder Typesystem Hierarchy Extractor Record labeler CRF models Consolidator Tables T 1,…,T k Consolidated Table Statistics Cell resolverRow resolver Ranker Row and cell scores Final consolidated table User Annotate Ontology

17 17 Offline: Extract record sources from the Web Types of sources Tables Many are presentation tables Only 9% real tables Lists Many verbose lists Other regular sources Record boundary not obvious: arbitrary layout tags, need to exploit visual regularity Extract relevant data + context

18 18 Offline: Annotating to an Ontology Annotate table cells with entity nodes and table columns with type nodes movies Indian_films English_films 2008_films Terrorism_films A_Wednesday Black&White Coffee_house (film) Wednesday All People Entertainers Coffee_house (Loc) Indian_films 2008_films Indian_directors Coffee_house (film) Black&White A_Wednesday

19 Challenges Ambiguity of entity names Coffee house both a movie name and a place name Noisy mentions of entity names Black&White versus Black and White Multiple labels Yago Ontology has average 2.2 types per entity Missing type links in Ontology cannot use least common ancestor Missing link: Black&White to 2008_films Not a missing link: 1920 to Terrorism_films Scale: Yago has 1.9 million entities, 200,000 types 19

20 A unified approach Graphical model to jointly label cells and columns to maximize sum of scores on y cj = Entity label of cell c of column j y j = Type label of column j Score( y cj ): String similarity between c & y cj. Score( y j ): String similarity between header in j & y j Score( y j, y cj ) Subsumed entity: Inversely proportional to distance between them Outside enity: Fraction of overlapping entities between y j and immediate parent of y cj Handles missing links: Overlap of 2008_movies with 2007_movies zero but with Indian movies is non-zero. movies Indian_films English_films yjyj Terrorism_films Subsumed entity y 1j Outside entity y 3j Subsumed entity y 2j

21 21 WWT Architecture Index Query Builder Web Extract record sources Query Table Content+ context index Offline Store Keyword Query Source L 1,…,L k Type Inference Resolver Resolver builder Typesystem Hierarchy Extractor Record labeler CRF models Consolidator Tables T 1,…,T k Consolidated Table Statistics Cell resolverRow resolver Ranker Row and cell scores Final consolidated table User Annotate Ontology

22 When a query Q arrives Index probe Content query: convert Q to a bag of words and search Content index Description query: search Context index matching sources L1, L2,..Lk Match sources to Qs schema and extract tables T1,T2,..Tk from them. Soft-union/join tables from sources Rank result rows (Approximate selection) 22

23 23 Extraction: Content queries Extracting queries columns from list records New York University (NYU), New York City, founded in 1831. Columbia University, founded in 1754 as Kings College. Binghamton University, Binghamton, established in 1946. State University of New York, Stony Brook, New York, founded in 1957 Syracuse University, Syracuse, New York, established in 1870 State University of New York, Buffalo, established in 1846 Rensselaer Polytechnic Institute (RPI) at Troy. Cornell UniversityIthaca State University of New YorkStony Brook New York UniversityNew York Lists are often human generated. Query: Q A source: Li

24 24 Extraction New York University (NYU), New York City, founded in 1831. Columbia University, founded in 1754 as Kings College. Binghamton University, Binghamton, established in 1946. State University of New York, Stony Brook, New York, founded in 1957 Syracuse University, Syracuse, New York, established in 1870 State University of New York, Buffalo, established in 1846 Rensselaer Polytechnic Institute (RPI) at Troy. Rule-based extractor insufficient. Statistical extractor needs training data. Generating that is also not easy! Extracted table columns Cornell UniversityIthaca State University of New YorkStony Brook New York UniversityNew York Query: Q

25 25 Extraction: Labeled data generation A fast but naïve approach for generating labeled records New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton State University of New York in Stony Brook, New York. Query about colleges in NY Fragment of a relevant list source Lists are unlabeled. Labeled records needed to train a CRF New York UniversityNew York Monroe CollegeBrighton State University of New YorkStony Brook

26 26 Extraction: Labeled data generation New York UniversityNew York Monroe CollegeBrighton State University of New YorkStony Brook A fast but naïve approach New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton State University of New York in Stony Brook, New York. In the list, look for matches of every query cell. Another match for New York University Another match for New York

27 27 Extraction: Labeled data generation New York UniversityNew York Monroe CollegeBrighton State University of New YorkStony Brook A fast but naïve approach New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton State University of New York in Stony Brook, New York. In the list, look for matches of every query cell. Greedily map each query row to the best match in the list 1 2

28 28 Extraction: Labeled data generation New York UniversityNew York Monroe CollegeBrighton State University of New YorkStony Brook A fast but naïve approach New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton State University of New York in Stony Brook, New York. Hard matching criteria has significantly low recall Missed segments. Does not use natural clues like Univ = University Greedy matching can be lead to really bad mappings 1 2 Unmapped (hurts recall) Wrongly Mapped Assumed as Other

29 29 Generating labeled data: Soft approach New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton State University of New York in Stony Brook, New York. New York UniversityNew York Monroe CollegeBrighton State University of New YorkStony Brook

30 30 Match score for each query and source row Score of best segmentation of source row to query columns Score of a segment s of column c: Probability Cell c of query row same as segment s Computed by the Resolver module based on the type of the column New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton State University of New York in Stony Brook, New York. 0.9 0.3 1.8 Generating labeled data: Soft approach New York UniversityNew York Monroe CollegeBrighton State University of New YorkStony Brook

31 31 New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton State University of New York in Stony Brook, New York. 2.0 0.7 0.3 Generating labeled data: Soft approach Match score for each query and source row Score of best segmentation of source row into query columns Score of a segment s of column c: Probability Cell c of query row same as segment s Computed by the Resolver module based on the type of the column New York UniversityNew York Monroe CollegeBrighton State University of New YorkStony Brook

32 32 New York UniversityNew York Monroe CollegeBrighton State University of New YorkStony Brook New York Univ. in NYC Columbia University in NYC Monroe Community College in Brighton State University of New York in Stony Brook, New York. Compute the maximum weight matching Better than greedily choosing the best match for each row Soft string-matching increases the labeled candidates significantly Vastly improves recall, leads to better extraction models. 0.9 0.3 1.8 0.7 0.3 2 Greedy matching in red Generating labeled data: Soft approach

33 33 Extractor Use CRF on the generated labeled data Feature Set Delimiters, HTML tokens in a window around labeled segments. Alignment features Collective training of multiple sources

34 34 Experiments Aim: Reconstruct Wikipedia tables from only a few sample rows. Sample queries TV Series: Character name, Actor name, Season Oil spills: Tanker, Region, Time Golden Globe Awards: Actor, Movie, Year Dadasaheb Phalke Awards: Person, Year Parrots: common name, scientific name, family

35 35 Experiments: Dataset Corpus: 16M lists from 500M pages from a web crawl. 45% of lists retrieved by index probe are irrelevant. Query workload 65 queries. Ground truth hand-labeled by 10 users over 1300 lists. 27% queries not answerable with one list (difficult). True consolidated table = 75% of Wikipedia table, 25% new rows not present in Wikipedia.

36 36 Extraction performance Benefits of soft training data generation, alignment features, staged-extraction on F1 score. More than 80% F1 accuracy with just three query records

37 Queries in WWT Query by content Query by description 37 Alan TuringTuring Machine E. F. CoddRelational Databases DeshLate night BhairaviMorning PatdeepAfternoon InventorComputer science conceptYear Indian statesAirport City

38 Extraction: Description queries Lithium3 Sodium11 Beryllium4 Non-informative headers No headers

39 Context to get at relevant tables Ontological annotations Context is union of Text around tables Headers Ontology labels when present 39 Chemical_ elements Metals Non_Metals Alkali Gas Aluminium Lithium Hydrogen All People Non alkali Lithium3 Sodium11 Beryllium4 Non-gas CarbonSodium Alkali Chemical element

40 Joint labeling of table columns Given Candidate tables: T 1,T 2,..T n Query column q 1, q 2,.. q m Task: label columns of T i with { q 1, q 2, …, q m, } to maximize sum of these scores Score ( T, j, q k ) = Ontology type match + Header string match with q k Score ( T, *, q k ) = Match of description of T with q k Score ( T, j, T, j, q k ) = Content overlap of column j of table T with column j of table T when both label q k Inference algorithm in a graphical model solve via Belief Propagation. 40

41 41 WWT Architecture Index Query Builder Web Extract record sources Query Table Content+ context index Offline Store Keyword Query Source L 1,…,L k Type Inference Resolver Resolver builder Typesystem Hierarchy Extractor Record labeler CRF models Consolidator Tables T 1,…,T k Consolidated Table Statistics Cell resolverRow resolver Ranker Row and cell scores Final consolidated table User Annotate Ontology

42 42 Step 3: Consolidation Cornell UniversityIthaca State University of New York Stony Brook New York UniversityNew York City Binghamton UniversityBinghamton Merging the extracted tables into one SUNYStony Brook New York University (NYU) New York RPITroy Columbia UniversityNew York Syracuse UniversitySyracuse + Cornell UniversityIthaca State University of New York OR SUNY Stony Brook New York University OR New York University (NYU) New York City OR New York Binghamton UniversityBinghamton RPITroy Columbia UniversityNew York Syracuse UniversitySyracuse = Merging duplicates

43 Consolidation Challenge: resolving when two rows are the same in the face of Extraction errors Missing columns Open-domain No training. Our approach: a specially designed Bayesian Network with interpretable and generalizable parameters

44 44 Consolidator Merges tables extracted from various lists into one Duplicate records across sources clustered together. NP-hard to merge > 2 tables jointly, so do it iteratively In each step, merge an extracted table T into the current consolidated table CT Find extent of match between rows of T and CT Use P(RowMatch|r1,r2) from the Row Resolver Compute Max-weight bipartite matching to find duplicates Unmatched rows in T form new singleton clusters in CT

45 45 Resolver P(RowMatch|rows q,r) P(1 st cell match|q 1,r 1 ) P(i th cell match|q i,r i ) P(n th cell match|q n,r n ) Bayesian Network Cell-level probabilities Parameters automatically set using list statistics Derived from user-supplied type-specific similarity functions

46 46 Resolver P(RowMatch|rows q,r) P(1 st cell match|q 1,r 1 ) P(i th cell match|q i,r i ) P(n th cell match|q n,r n ) Used during consolidation Bayesian Network Used in the soft approach for generating labeled data Parameters trained automatically using list statistics (details in paper) Supports arbitrary number of null columns (P = 0.5) Exploits a type-system

47 47 Ranking Factors for ranking – Relevance: membership in overlapping sources – Support from multiple sources – Completeness: importance of columns present Penalize records with only common spam columns like City and State Correctness: extraction confidence SchoolLocationStateMerged Row ConfidenceSupport --NY0.999 -NYCNew York0.957 New York Univ. OR New York University New York City OR New York New York0.854 University of Rochester OR Univ. of Rochester, RochesterNew York0.502 University of BuffaloBuffaloNew York0.702 Cornell UniversityIthacaNew York0.761

48 Relevance ranking on set membership Weighted sum approach Score of a set t: s(t) = fraction of query rows in t Relevance of consolidated row r: r t s(t) Graph walk based approach Random walk from rows to table nodes starting from query rows along with random restarts to query rows 48 Tables Consolidated rows Query rows

49 49 Ranking Criteria Score(Row r): ×Graph-relevance of r. ×Importance of columns C present in r (high if C functionally determines the other) ×Sum of cell extraction confidence: noisy-OR of cell extraction confidence from individual CRFs SchoolLocationStateMerged Row ConfidenceSupport New York Univ. OR New York University (0.90) New York City OR New York (0.95) New York (0.98)0.854 University of Buffalo (0.88)Buffalo (0.99)New York (0.99)0.702 Cornell University (0.92)Ithaca (0.95)New York (0.99)0.761 University of Rochester OR Univ. of Rochester, (0.80) Rochester (0.95)New York (0.99)0.502 --NY (0.99)0.999 -NYC (0.98)New York (0.98)0.957

50 50 Overall performance Justify sophisticated consolidation and resolution. So compare with: Processing only the magically known single best list => no consolidation/resolution required. Simple consolidation. No merging of approximate duplicates. WWT has > 55% recall, beats others. Gain bigger for difficult queries. All Queries Difficult Queries

51 51 Running time < 30 seconds with 3 query records. Can be improved by processing sources in parallel. Variance high because time depends on number of columns, record length etc.

52 52 Related Work Google-Squared Developed independently. Launched in May 2009 User provides keyword query, e.g. list of Italian joints in Manhattan. Schema inferred. Technical details not public. Prior methods for extraction and resolution. Assume labeled data/pre-trained parameters We generate labeled data, and automatically train resolver parameters from the list source.

53 53 Summary Structured web search & the role of non-text, partially structured web sources WWT system Domain-independent Online: structure interpretation at query time Relies heavily on unsupervised statistical learning Graphical model for table annotation Soft-approach for generating labeled data Collective column labeling for descriptive queries Bayesian network for resolution and consolidation Page rank + confidence from a probabilistic extractor for ranking

54 What next? Designing plans for non-trivial ways of combining of sources Better ranking and user-interaction models. Expanding query set Aggregate queries: tables are rich in quantities Point queries: attribute value and relationship queries Interplay between semi-structured web & Ontologies Augmenting one with the other. Quantify information in structured sources vis-à-vis text sources on typical query workloads 54


Download ppt "Querying for relations from the semi-structured Web Sunita Sarawagi IIT Bombay Contributors Rahul Gupta Girija Limaye."

Similar presentations


Ads by Google