Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.

Similar presentations


Presentation on theme: "Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine."— Presentation transcript:

1 Understanding Tables on the Web Jingjing Wang

2 Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine Focus on HTML tables(horizontal) because… Billions tables on the Web that contain valuable information Tables are well structured and easier to understand

3 Understanding Tables Knowing the structure of data? How a human understand tables? Certain background knowledge

4 Understanding Tables(cont.) Key for understanding the tables : What is the most likely concept that contains a set of given entities? What is the most likely concept that has a set of given attributes? The problem of understanding a web table => associating the table with one or ore semantic concepts in a general purpose knowledge base (Probase)

5 Building a Knowledge Taxonomy (Probase) Made up of worldly facts automatically constructed form 50 Terabytes Web corpus and other data 2.7 million concepts which contain a set of entities ranked by their popularity or other scores, and also a set of attributes used to describe entities in that concept The backbone of Probase is constructed by the Hearst patterns Not powerful enough for extracting attributes and values

6 Probase (cont.) Linguistic pattern to discover seed attributes for concept C: What is the A of I? What entities should be used? How to rank candidate seed attributes to obtain final seeds? 10.5 million raw seed attributes for about 1 million calsses Identified table schema enrich Probase 30 concepts and their top 20 seed attributes have 0.96 precision

7 A Snippet of the Probase Taxonomy

8 The flowchart for understanding tables

9 Understanding Tables Knowledge APIs for Schema Extraction K A (A): for a set of attributes A, K A (A)returns a list of triples···,(c i,A i,sa i ),··· ordered by score sa i, where c i is a likely concept for A, A i A are attributes of concept c i, and sa i is the score indicating the confidence of c i given A. (useful in table header detection) K E (E): for a set of entities E, K E (E) returns a list of triples ···,(c i,E i,se i ),··· ordered by score se i, where c i is a likely concept for E, E i E are entities of concept c i, and se i is the score indicating the confidence of c i given E. (useful when to generate header)

10 Understanding Tables (cont.) Knowledge APIs for Schema Extraction A = {Name, Birthdate, Political Party, Assumed Office, Height} (US presidents, {Birthdate, Political Party, Assumed Office}, 0.90) (politicians, {Birthdate, Political Party, Assumed Office}, 0.88) (NBA players, {Birthdate, Height}, 0.65)... E = {Name, Barack Obama, Arnold Schwarzenegger, Hillary Clinton} (politicians, {Barack Obama, Arnold Schwarzenegger, Hillary Clinton}, 0.95}) (actors, {Arnold Schwarzenegger}, 0.5})...

11 Understanding Tables (cont.) Head Detector k A () to evaluate each possibility and generate a set of candidate schema + α(p,T) because the header usually has some syntactic characteristics that set it apart from the rest of the table If candidate_schema is empty: Possibly, the tables have no header => generate header From the example of Table 2, a properly set threshold will find the first row as the header (US presidents, {Birthdate, Political Party, Assumed Office}, 0.90) (politicians, {Birthdate, Political Party, Assumed Office}, 0.88)

12 Understanding Tables (cont.) Header Generator For each column L i, find most likely concept top-k candidate concepts from K E (L i ) Still no candidate_schema? => forget it!

13 Understanding Tables (cont.) Entity Detector Accomplish two tasks: Detects the entity column of the table Narrow down previously derived candidate schemata Base idea: The entity column should contain entities of the same concept, and it should be able to derive the confidence of a concept for a given column The header should contain attributes that describe entities in the entity column

14 Understanding Tables (cont.) Entity Detector s candidate_schema E col : the set of all cells in col, except for the one in the header corresponds to s A col : the set of all attributes in s, except for the one in the current column. Apply K A () and K E () to obtain their possible semantics SC A = ordered list of (c i, A col, sa i ) SC E = ordered list of (c i, E col, se i ) (politicians, {Birthdate, Political Party, Assumed Office}, 0.88)

15 Results The Web Table Corpus Header detection: randomly selected 200 tables Recall: 89.5% Entity column detection: randomly selected 200 tables extracted from Wikipedia only 11 tables do not have AN entity Precision: 87.3%(165 / 189)

16 Results Search Engine Semantic search engine that operates upon table statement Find the semantics of a query returning a set of statements that match the semantics Four semantic components in a query: Concept, Entity, Attribute and Keyword (Concept + Attribute) Tested only on Wikipedias tables 3 attributes for each of 30 concepts

17 Results Vs Google Ran the same queries to Google Manually judged top 10 pages The format of the most pages make it impractical to extract the information that is needed Vs Google Squared

18 Results Taxonomy(Probase) Expansion Entity expansion: Select top 1000 entities ranked by ambiguity a c (e), then use plausibility score p c (e) to infer One iteration: Found 3.4 million existed entities in Probase Found 4.6 million new entities for about 20,000 concepts Attribute expansion: One iteration: Discovered 0.15 million new attributes for nearly 14,000 concepts

19 Conclusion A frame work attempt to harvest useful knowledge from the rich corpus of relational data on the Web: HTML tables Through multi-phase algorithm, and with the help of a universal probabilistic taxonomy(Probase), the framework is capable of understanding the entities, attributes and values in many tables on the Web Two interesting application: A semantic table search engine A tool to further expand and enrich Probase


Download ppt "Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine."

Similar presentations


Ads by Google