Presentation is loading. Please wait.

Presentation is loading. Please wait.

Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Pas¸ca, Warren Shen, Fei Wu, Gengxin Miao, Chung Wu 2011, VLDB Xunnan Xu.

Similar presentations


Presentation on theme: "Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Pas¸ca, Warren Shen, Fei Wu, Gengxin Miao, Chung Wu 2011, VLDB Xunnan Xu."— Presentation transcript:

1 Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Pas¸ca, Warren Shen, Fei Wu, Gengxin Miao, Chung Wu 2011, VLDB Xunnan Xu

2 Annotating tables (the recovery of semantics) Title could be missing Subjects could be missing Relevant information might not be close at all Improve table search Bloom period (Property) of shrubs (Class) <- focused on in this paper Color (Property) of Azalea (Instance)

3 isA database Berlin is a city. CSCI572 is a course. relation database Microsoft is headquartered in Redmond. San Francisco is located in California. Why is this useful? Tables are structured, more popular names could help identify others CityState San FranciscoCalifornia San MateoCalifornia

4 Extract pairs from web pages with patterns like: Easy? Not really… To check the boundary of a Class: noun phrases whose last component is a plural-form noun and that are not contained in and do not contain another noun phrase Michigan counties such as Among the lovely cities To check the boundary of an Instance: I occurs as an entire query in query logs

5 Mine more instances Headquartered in I => I is a city Handle sentence duplicates: Sentence fingerprint -> the hash of first 250 characters Score the pairs: Score(I, C) = Size({Pattern(I, C)}) 2 x Freq(I, C) {Pattern(I, C)} – the set of patterns Freq(I, C) – the number of appearances Similar to tf/idf

6 TextRunner was used to extract the relations TextRunner is a research project at the University of Washington. It uses Conditional Random Field (CRF) to detect the relations among noun phrases. CRF is a popular word in machine learning world: applying pre- defined feature functions to phrases to compute the final probability of a sentence (normalized score 0 ~ 1) Example: f(sentence, i, label i, label i-1 ) = 1 if word i is in and label i-1 is an adjective, otherwise 0 => Microsoft is headquartered in beautiful Redmond.

7 Assumptions If many instances in that column are assigned to a class, then the next instance very likely also belongs to it. The best label is the one that is most likely to produce the observed values in the column. (maximum likelihood hypothesis) Definitions v i – value i in column L i – possible label for that column, L(A) – the best label U(l i, V) – the score of label i after assigned to the set (V) of values

8

9

10 Gold standard Labels are manually evaluated by annotators Vital > okay > incorrect Allegan, Barry, Berrien –> Michigan counties (vital) Allegan, Barry, Berrien -> Illinois counties (incorrect) Relation quality 128 binary relations using gold standard Web-extractedYAGO from WikipediaFreebase Labeled subject columns1,496,550185,013577,811 Instances in ontology155,831,8551,940,79716,252,633 Web-extractedFreebase No. of relations vital/okay83 (64.8%)37 (28.9%)

11 Results are fetched automatically but compared manually: 100 queries, using top-5 of the results – 500 Results were shuffled and evaluated by 3 people using single blinding test Scores: right on - has all information about a large number of instances of the class and values for the property relevant - has information about only some of the instances, or of properties that were closely related to the queried property irrelevant Candidates TABLE – the method in this paper GOOG – results from google.com GOOGR – top 1000 results from Google intersected with the table corpus) DOCUMENT – document-based approach

12 Method All RatingsRatings by Queries Totalabc Similar Results abc TABLE17569989349244140 DOC39924584793133632 GOOG4936311652100325235 GOOGR15643675965173229 Method Query PrecisionQuery Recall abcabc TABLE0.630.770.790.520.510.62 DOC0.200.370.340.310.440.50 GOOG0.420.580.370.710.750.59 GOOGR0.350.500.460.390.420.48 (a) right on, (b) right on or relevant, (c) right on or relevant and in a table


Download ppt "Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Pas¸ca, Warren Shen, Fei Wu, Gengxin Miao, Chung Wu 2011, VLDB Xunnan Xu."

Similar presentations


Ads by Google