Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using linked data to interpret tables Varish Mulwad September 14, 2010 1.

Similar presentations


Presentation on theme: "Using linked data to interpret tables Varish Mulwad September 14, 2010 1."— Presentation transcript:

1 Using linked data to interpret tables Varish Mulwad September 14, 2010 1

2 Interpreting a table http://dbpedia.org/resource/Baltimore Link Cell Value to an entity Find Relationships between columns http://dbpedia.org/onto logy/PopulatedPlace LargestCity 2

3 Annotate web tables Confirm existing facts in LOD Discover knowledge and new facts Search / query over web tables Data integration 1000 reasons why it’s important …

4 @prefix rdfs:. @prefix dbpedia:. @prefix dbpedia-owl:. “City”@en is rdfs:label of dbpedia-owl:City. “State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion. “Baltimore”@en is rdfs:label of dbpedia:Baltimore. dbpedia:Baltimore a dbpedia-owl:City. … Interpreting a table 4

5 Overview Introduction Related Work & Motivation Approach Results Upcoming Work Conclusion 5

6 6 Introduction

7 The World Wide Web … ……………… ……………… ……………… ……………… ……………… ……………… Talk: abc By: xyz Venue: some location Talk: abc By: xyz Venue: some location ……………… ……………… 7

8 The World Wide Web … Good for you and me … … not so good for machines Images from http://www.bbc.co.uk/blogs/radiolabs/s5/linked-data/s5.htmlhttp://www.bbc.co.uk/blogs/radiolabs/s5/linked-data/s5.html 8

9 Web of Data – The Semantic Web Image – www.linkeddata.org 9

10 Linked Data The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web. Every resource has a URI: Baltimore: http://dbpedia.org/resource/Baltimore 10

11 Related Work and Motivation 11

12

13 Chicken ? Egg ? … No Chicken ? More than a trillion documents on the Web ~ 14.1 billion tables, 154 million with high quality relational data (Cafarella et al. 2008) Where is structured data ? 13

14 Automate the process We need systems that can generate data from existing sources Not practical for humans to encode all this into RDF manually 14

15 On the Semantic Web … Mapping Relational databases to RDF [W3C working group – RDB2RDF] Mapping spreadsheets to RDF [RDF123, XLWrap] Practical and helpful systems but … – Require significant manual work – Do not generate linked data

16 … elsewhere Learning to index tables to improve search experience (Cafarella et al. 2008) Expanding attributes (columns) of web tables (Lin et al. 2010) Interpreting web tables to answer complex search queries over the web tables (Limaye et al. 2010)

17 Interpreting a Table 17

18 T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations T2LD Framework 18

19 T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations 19

20 Predicting Class Labels for column City Baltimore Boston New York Type Instance Type Class / Type for the column

21 Querying the Knowledge–Base City Baltimore Boston New York 1.Baltimore 2. Baltimore County 3. John Baltimore 1.Boston 2. Boston_(band) 3. Boston_University 1. New_York_City 2. New_York 3. New_York_(album) 21 {dbpedia-owl:Place, dbpedia- owl:AdminstrativeRegion, dbpedia-owl:City, dbpedia- owl:Area,yago:AmericanConduct ors,yago:LivingPeople} Types {dbpedia-owl:Place, dbpedia- owl:PopulatedPlace, dbpedia- owl:Band, dbpedia- owl:Organisation …. ….. ….. } {……………………………………………… ……………. }

22 Scoring the classes Possible Classes for the column - dbpedia-owl:Place dbpedia-owl:AdminstrativeRegion dbpediaowl:City, yago:AmericanConductors yago:LivingPeople dbpedia-owl:Band dbpedia-owl:Organisation … [Baltimore, dbpedia-owl:City] [Boston, dbpedia-owl:City] [New York, dbpedia-owl:City] …. [Baltimore,dbpedia-owl:Band] [Boston,dbpedia-owl:Band] … E.g. Processing class – “dbpedia-owl:City” String Baltimore: (R = 1) Baltimore {dbpedia-owl:City, dbpedia-owl:Place} [PR = 6] (R = 2) Baltimore County {dbpedia-owl:AdministrativeRegion} [PR = 4] (R = 3) John Baltimore {yago:AmericanConductors,yago:LivingPeople} [PR = 5] Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank) [Baltimore, dbpedia:City] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = 0.892

23 T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations 23

24 Approach Table Cell + Column Header + Row Data + Column Type Requery KB with predicted class labels as additional evidence Generate a feature vector for the top N results of the query Classifier ranks the entities within the set of possible results Select the highest ranked entity Classifier decides whether to link or not Link to “NIL” Link to the top ranked instance 24

25 Learning to Rank We trained a SVM rank classifier which learnt to rank entities within a given set Feature Vector Similarity Measures Popularity Measures Levenshtein distance Dice Score Wikitology Score PageRank Page Length 25

26 “To Link or not to Link … ’’ The highest ranked entity may not the correct one to link to … – Because the string we are querying may not be in the KB – Top N results may not include the correct answer We trained an SVM classifier which would determine whether to link to the top one or not 26

27 “To Link or not to Link … ’’ Feature vector included the feature vector of the top ranked entity and additional two features – – The SVM rank score of the top ranked entity – The difference in scores between the top two ranked entities 27

28 T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations 28

29 Relation between columns City Baltimore Boston New York State Maryland Massachusetts New York 29

30 Relation between columns Maryland - Baltimore Massachusetts - Boston New York - New York dbonto:LargestCity dbonto:Capital dbonto:LargestCity dbonto:Capital dbonto:LargestCity Candidate relations 30

31 Scoring the relations Maryland - Baltimore Massachusetts - Boston New York - New York dbonto:LargestCity dbonto:Capital dbonto:LargestCity Candidates: dbonto:Capital dbonto:LargestCity dbonto:Capital Score:0 dbonto:Capital Score:1 dbonto:LargestCity Score:3 31

32 T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations 32

33 Annotating web tables for the Semantic Web

34 Table as linked RDF @prefix rdfs:. @prefix dbpedia:. @prefix dbpedia-owl:. @prefix dbpprop:. “City”@en is rdfs:label of dbpedia-owl:City. “State”@en is rdfs:label of dbpedia-owl:AdminstrativeRegion. “Baltimore”@en is rdfs:label of dbpedia:Baltimore. dbpedia:Baltimore a dbpedia-owl:City. “MD”@en is rdfs:label of dbpedia:Maryland. dbpedia:Maryland a dbpedia-owl:AdministrativeRegion. dbpprop:LargestCity rdfs:domain dbpedia-owl:AdminstrativeRegion. dbpprop:LargestCity rdfs:range dbpedia-owl:City. “City”@en is rdfs:label of dbpedia-owl:City. “City” is the common / human name for the class dbpedia-owl:City dbpedia:Baltimore a dbpedia-owl:City. dbpedia:Baltimore is a type (instance) dbpedia-owl:City dbpprop:LargestCity rdfs:domain dbpedia-owl:AdminstrativeRegion.  The subjects of the triples using the property have to be instances of dbpedia- owl:AdminstrativeRegion dbpprop:LargestCity rdfs:range dbpedia-owl:City.  The objects of the triples using the property have to be instances of dbpedia-owl:City 34

35 Results 35

36 Dataset summary Number of Tables15 Total Number of rows199 Total Number of columns56 (52) Total Number of entities639 (611) * The number in the brackets indicates # excluding columns that contained numbers 36

37 Dataset summary 37

38 Dataset summary 38

39 Evaluation for class label predictions 39

40 Evaluation # 1 (MAP) Compared the system’s ranked list of labels against a human ranked list of labels Metric - Mean Average Precision (MAP) Commonly used in the Information Retrieval domain to compare two ranked sets 40

41 Evaluation # 1 (MAP) 41 80.76 % System Ranked: 1. Person 2. Politician 3. President Evaluator Ranked: 1. President 2. Politician 3. OfficeHolder

42 Evaluation # 2 (Recall) Recall > 0.6 (75 %) 42 System Ranked: 1. Person 2. Politician 3. President Evaluator Ranked: 1. President 2. Politician 3. OfficeHolder

43 Evaluation # 3 (Correctness) Evaluated whether our predicted class labels were “fair and correct” Class label may not be the most accurate one, but may be correct. – E.g. dbpedia:PopulatedPlace is not the most accurate, but still a correct label for column of cities Three human judges evaluated our predicted class labels 43

44 Evaluation # 3 (Correctness) A category-wise breakdown for class label correctness Overall Accuracy: 76.92 % 44 Column – Nationality Prediction – MilitaryConflict Column – Birth Place Prediction – PopulatedPlace

45 Evaluation for linking table cells to entities 45

46 Category-wise accuracy for linking table cells Overall Accuracy: 66.12 % 46

47 Relation between columns Idea – Ask human evaluators to identify relations between columns in a given table Pilot Experiment – Asked three evaluators to annotate five random tables from our dataset Evaluators identified 20 relations Our accuracy – 5 out of 20 (25 % ) were correct 47

48 Future Work 48 Current

49 Automatic/Semi-automatic template learning

50 Confirming LD Facts BaltimoreMDS.Rawlings…. For Baltimore, Dbpedia says: Dbpprop:LeaderName – S.Dixon Dbpprop:LeaderName – S.Dixon S.Rawlings

51 Discover knowledge, relations Inception rdf:type dbpedia-owl:Movie Howard County rdf:type dbpedia:AdminstrativeRegion David Beckham dbpedia-owl:Team dbpedia: Los_Angeles_Galaxy

52 Conclusion There’s lot of data that is stored in html tables, spreadsheets, databases and documents We presented an automatic framework to interpret such data We believe our work will contribute in materializing the web of data vision

53 References Han, L., Finin, T., Parr, C., Sachs, J., Joshi, A.: RDF123: from Spreadsheets to RDF. In: Seventh International Semantic Web Conference, Springer (2008) Langegger, A., Wob, W.: Xlwrap - querying and integrating arbitrary spreadsheets with sparql. In: 8th International Semantic Web Conference (ISWC2009). (2009) Cafarella, M.J., Halevy, A.Y.,Wang, Z.D.,Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. PVLDB 1 (2008) 538 - 549 Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. In: Proc. of the 36th Int'l Conference on Very Large Databases (VLDB). (2010) Lin, C. X.; Zhao, B.;Weninger, T.; Han, J.; and Liu, B. 2010. Entity relation discovery from web tables and links. In Rappa, M.; Jones, P.; Freire, J.; and Chakrabarti, S., eds., WWW, 1145– 1146. ACM.

54


Download ppt "Using linked data to interpret tables Varish Mulwad September 14, 2010 1."

Similar presentations


Ads by Google