Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using linked data to interpret tables Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County November 8, 2010.

Similar presentations


Presentation on theme: "Using linked data to interpret tables Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County November 8, 2010."— Presentation transcript:

1 Using linked data to interpret tables Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County November 8, 2010 1

2 Interpreting a table NameTeamPositionHeight Michael JordanChicagoShooting guard1.98 Allen IversonPhiladelphiaPoint guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower forward2.11 http://dbpedia.org/class/yago/Natio nalBasketballAssociationTeams http://dbpedia.org/resource/Allen_Iverson Map numbers as values of properties dbprop:team

3 Interpreting a table NameTeamPositionHeight Michael JordanChicagoShooting guard1.98 Allen IversonPhiladelphiaPoint guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower forward2.11 @prefix dbpedia:. @prefix dbpedia-owl:. @prefix yago:. "Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer. "Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams. "Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan. dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer. "Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls. dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams. @prefix dbpedia:. @prefix dbpedia-owl:. @prefix yago:. "Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer. "Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams. "Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan. dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer. "Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls. dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams.

4 Use Cases NameTeamPositionHeight Michael JordanChicagoShooting guard1.98 Allen IversonPhiladelphiaPoint guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower forward2.11 NameTeamPositionHeight Michael JordanChicagoShooting guard1.98 Allen IversonPhiladelphiaPoint guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower forward2.11 NameTeamPositionHeight Michael JordanChicagoShooting guard1.98 Allen IversonPhiladelphiaPoint guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower forward2.11 NameTeamPositionHeight Michael JordanChicagoShooting guard1.98 Allen IversonPhiladelphiaPoint guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower forward2.11 NameTeamPositionHeight Michael JordanChicagoShooting guard1.98 Allen IversonPhiladelphiaPoint guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower forward2.11 Intelligent querying over data Create a ‘Semantic’ knowledge-base

5 Use Cases NameTeamPositionHeight Michael JordanChicagoShooting guard1.98 Allen IversonPhiladelphiaPoint guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower forward2.11 @prefix dbpedia:. @prefix dbpedia-owl:. @prefix yago:. "Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer. "Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams. "Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan. dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer. "Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls. dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams. @prefix dbpedia:. @prefix dbpedia-owl:. @prefix yago:. "Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer. "Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams. "Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan. dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer. "Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls. dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams. Data Integration Search / Query over tables NameTeamPositionHeight Michael JordanChicagoShooting guard1.98 Allen IversonPhiladelphiaPoint guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower forward2.11 Confirm/Verify existing knowledge Add new knowledge to the LOD cloud Convert legacy data into Semantic Web formats

6 Motivation and Related Work

7

8 We are laying a strong foundation for the Semantic Web … … but an old problem haunts us …

9 Chicken ? Egg ? … No Chicken ? ~ 14.1 billion tables, 154 million with high quality relational data (Cafarella et al. 2008) 305,632 Datasets available as CSV or spreadsheets on Data.gov (US) + 7 Other nations establishing open data Where is structured data ?

10 Automate the process We need systems that can generate data from existing sources Not practical for humans to encode all this into RDF manually

11 Related Work Database to Ontology mapping (Barrasa, scar Corcho, & Gmez-prez 2004), (Hu & Qu 2007), (Papapanagiotou et al. 2006), and (Lawrence 2004) Mapping Relational databases to RDF [W3C working group – RDB2RDF]

12 Related Work Mapping spreadsheets to RDF [RDF123, XLWrap] Practical and helpful systems but … – Require significant manual work – Do not generate linked data Interpreting web tables to answer complex search queries over the web tables (Limaye et al. 2010)

13 T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations T2LD Framework

14 Predict Class for Columns Predict Class for Columns Linking the table cells Identify and Discover relations

15 Predicting Class Labels for column Team Chicago Philadelphia Houston San Antonio Class Instance Class for the column Class 1 Class 2 Class 3 Class 4

16 Knowledge Base Yago Wikitology 1 – A hybrid knowledge base where structured data meets unstructured data 1 – Wikitology was created as part of Zareen Syed’s Ph.D. dissertation

17 Querying the Knowledge–Base 1. Chicago Bulls 2. Chicago 3. Judy Chicago 1. Chicago Bulls 2. Chicago 3. Judy Chicago 1. Philadelphia 2. Philadelphia 76ers 3. Philadelphia (film) 1. Philadelphia 2. Philadelphia 76ers 3. Philadelphia (film) 1. Houston Rockets 2. Houston 3. Allan Houston 1. Houston Rockets 2. Houston 3. Allan Houston {dbpedia-owl:Place,dbpedia- owl:City,yago:WomenArtist,yago :LivingPeople,yago:NationalBask etballAssociationTeams } Types {dbpedia-owl:Place, dbpedia- owl:PopulatedPlace, dbpedia- owl:Film,yago:NationalBasketb allAssociationTeams …. ….. ….. } {……………………………………………… ……………. } Team Chicago Philadelphia Houston San Antonio

18 Scoring the classes Possible Classes for the column - dbpedia-owl:Place dbpedia-owl:City yago:WomenArtist yago:LivingPeople yago:NationalBasketballAssociationTeams dbpedia-owl:PopulatedPlace dbpedia-owl:Film … … Possible Classes for the column - dbpedia-owl:Place dbpedia-owl:City yago:WomenArtist yago:LivingPeople yago:NationalBasketballAssociationTeams dbpedia-owl:PopulatedPlace dbpedia-owl:Film … … [Chicago, dbpedia-owl:City] [Philadelphia, dbpedia-owl:City] [Houston, dbpedia-owl:City] …. [Chicago,dbpedia-owl:Film] [Philadelphia,dbpedia-owl:Film] … [Chicago, dbpedia-owl:City] [Philadelphia, dbpedia-owl:City] [Houston, dbpedia-owl:City] …. [Chicago,dbpedia-owl:Film] [Philadelphia,dbpedia-owl:Film] … E.g. Processing class – “Chicago,yago:NationalBasketballAssociationTeams” String Chicago: (R = 1) Chicago Bulls {yago:NationalBasketballAssociationTeams} [PR = 6] (R = 2) Chicago {dbpedia-owl:PopulatedPlace, dbpedia-owl:City} [PR = 5] (R = 3) Judy Chicago {yago:WomenArtist,yago:LivingPeople} [PR = 4] Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank) [Chicago, yago:NationalBasketballAssociationTeams] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = 0.892 E.g. Processing class – “Chicago,yago:NationalBasketballAssociationTeams” String Chicago: (R = 1) Chicago Bulls {yago:NationalBasketballAssociationTeams} [PR = 6] (R = 2) Chicago {dbpedia-owl:PopulatedPlace, dbpedia-owl:City} [PR = 5] (R = 3) Judy Chicago {yago:WomenArtist,yago:LivingPeople} [PR = 4] Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank) [Chicago, yago:NationalBasketballAssociationTeams] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = 0.892

19 T2LD Framework Predict Class for Columns Linking the table cells Linking the table cells Identify and Discover relations

20 Machine Learning based Approach Table Cell + Column Header + Row Data + Column Type Requery KB with predicted class labels as additional evidence Generate a feature vector for the top N results of the query Classifier ranks the entities within the set of possible results Select the highest ranked entity A second classifier decides whether to link or not Link to “NIL” Link to the top ranked instance

21 Learning to Rank We trained a SVM rank classifier which learnt to rank entities within a given set Feature Vector Similarity Measures Popularity Measures Levenshtein distance Dice Score Levenshtein distance Dice Score Wikitology Score PageRank Page Length Wikitology Score PageRank Page Length

22 “To Link or not to Link … ’’ A second SVM classifier Feature vector included the feature vector of the top ranked entity and additional two features – – The SVM rank score of the top ranked entity – The difference in scores between the top two ranked entities

23 T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Identify and Discover relations

24 Identify Relations Name Michael Jordan Allen Iverson Yao Ming Tim Duncan Team Chicago Philadelphia Houston San Antonio Rel ‘A’ Rel ‘A’, ‘C’ Rel ‘A’, ‘B’, ‘C’ Rel ‘A’, ‘B’

25 Relation between columns Michael Jordan - Chicago Allen Iverson - Philadelphia Yao Ming - Houston Michael Jordan - Chicago Allen Iverson - Philadelphia Yao Ming - Houston dbprop:team dbprop:draftTeam dbprop:team dbprop:draftTeam dbprop:team dbprop:team dbprop:draftTeam Candidate relations

26 Scoring the relations Michael Jordan - Chicago Allen Iverson – Philadelphia Yao Ming - Houston Michael Jordan - Chicago Allen Iverson – Philadelphia Yao Ming - Houston dbprop:team dbprop:team dbprop:draftTeam dbprop:team Candidates: dbprop:team dbprop:draftTeam Candidates: dbprop:team dbprop:draftTeam dbprop:draftTeam Score: 0 dbprop:draftTeam Score:1 dbprop:team Score:3

27 T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations

28 Annotating web tables for the Semantic Web

29 Table as linked RDF @prefix rdfs:. @prefix dbpedia:. @prefix dbpedia-owl:. @prefix yago:. "Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer. "Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams. "Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan. dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer. "Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls. dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams. @prefix rdfs:. @prefix dbpedia:. @prefix dbpedia-owl:. @prefix yago:. "Name"@en is rdfs:label of dbpedia-owl:BasketballPlayer. "Team"@en is rdfs:label of yago:NationalBasketballAssociationTeams. "Michael Jordan"@en is rdfs:label of dbpedia:Michael Jordan. dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer. "Chicago Bulls"@en is rdfs:label of dbpedia:Chicago Bulls. dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams. “Team”@en is rdfs:label of dbpedia-owl:Team. “Team” is the common / human name for the class dbpedia-owl:Team “Team”@en is rdfs:label of dbpedia-owl:Team. “Team” is the common / human name for the class dbpedia-owl:Team dbpedia:Chicago_Bulls a yago:NationalBasketballAssociationTeams. dbpedia:Chicago_Bulls is a type (instance) yago:NationalBasketballAssociationTeams dbpedia:Chicago_Bulls a yago:NationalBasketballAssociationTeams. dbpedia:Chicago_Bulls is a type (instance) yago:NationalBasketballAssociationTeams

30 Results

31 Dataset summary Number of Tables15 Total Number of rows199 Total Number of columns56 (52) Total Number of entities639 (611) * The number in the brackets indicates # excluding columns that contained numbers

32 Dataset summary

33

34 Evaluation for class label predictions

35 Evaluation # 1 (MAP) Compared the system’s ranked list of labels against a human ranked list of labels Metric - Mean Average Precision (MAP) Commonly used in the Information Retrieval domain to compare two ranked sets

36 Evaluation # 1 (MAP) 80.76 % System Ranked: 1. Person 2. Politician 3. President Evaluator Ranked: 1. President 2. Politician 3. OfficeHolder

37 Evaluation # 2 (Recall) Recall > 0.6 (75 %) System Ranked: 1. Person 2. Politician 3. President Evaluator Ranked: 1. President 2. Politician 3. OfficeHolder

38 Evaluation # 3 (Correctness) Evaluated whether our predicted class labels were “fair and correct” Class label may not be the most accurate one, but may be correct. – E.g. dbpedia-owl:PopulatedPlace is not the most accurate, but still a correct label for column of cities Three human judges evaluated our predicted class labels

39 Evaluation # 3 (Correctness) A category-wise breakdown for class label correctness Overall Accuracy: 76.92 % Column – Nationality Prediction – MilitaryConflict Column – Birth Place Prediction – PopulatedPlace

40 Evaluation for linking table cells to entities

41 Category-wise accuracy for linking table cells Overall Accuracy: 66.12 %

42 Relation between columns Idea – Ask human evaluators to identify relations between columns in a given table Pilot Experiment – Asked three evaluators to annotate five random tables from our dataset Evaluators identified 20 relations Our accuracy – 5 out of 20 (25 % ) were correct

43 Conclusion and Future Work

44 Conclusion We have demonstrated that it is possible to develop a automated framework for converting tables & spreadsheets to linked data Extending and adapting this framework for Open government data Discovery of new relations between entities

45 References Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., Zhang, Y., 2008. Webtables:exploring the power of tables on the web. Proc. VLDB Endow.1 (1), 538- 549. Barrasa, J., Corcho, O., Gomez-perez, A., 2004. R2o, an extensible and semantically based database-to-ontology mapping language. In Proceedings of the 2nd Workshop on Semantic Web and Databases(SWDB2004). Vol. 3372. pp. 1069- 1070. Hu, W., and Qu, Y. 2007. Discovering simple mappings between relational database schemas and ontologies. In Aberer, K.; Choi, K.-S.; Noy, N. F.; Allemang, D.; Lee, K.- I.; Nixon, L. J. B.; Golbeck, J.; Mika, P.; Maynard, D.; Mizoguchi, R.; Schreiber, G.;and Cudre-Mauroux, P., eds., ISWC/ASWC, volume 4825 of Lecture Notes in Computer Science, 225238. Springer. Papapanagiotou, P.; Katsiouli, P.; Tsetsos, V.; Anagnostopoulos, C.; and Hadjiefthymiades, S. 2006. Ronto: Relational to ontology schema matching. In AISSIGSEMIS BULLETIN.

46 Lawrence, E. D. R. 2004. Composing mappings between schemas using a reference ontology. In In Proceedings of International Conference on Ontologies, Databases and Application of Semantics (ODBASE), 783800. Springer Han, L.; Finin, T.; Parr, C.; Sachs, J.; and Joshi, A. 2008. RDF123: from Spreadsheets to RDF. In Seventh International Semantic Web Conference. Springer. Han, L., Finin, T., Yesha, Y., 2009. Finding semantic web ontology terms from words. In: Proceedings of the Eight International Semantic Web Conference. Springer. Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. In: Proc. of the 36th Int'l Conference on Very Large Databases (VLDB). (2010) References

47 This work was supported by:


Download ppt "Using linked data to interpret tables Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County November 8, 2010."

Similar presentations


Ads by Google