T2LD – An automatic framework for extracting, interpreting and representing tables as linked data Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
Schema Matching and Query Rewriting in Ontology-based Data Integration Zdeňka Linková ICS AS CR Advisor: Július Štuller.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Knowledge Base Completion via Search-Based Question Answering
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Large-Scale Entity-Based Online Social Network Profile Linkage.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
Research Problems in Semantic Web Search Varish Mulwad ____________________________ 1.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
LOD 123: Making the semantic web easier to use Tim Finin University of Maryland, Baltimore County Joint work with Lushan Han, Varish Mulwad, Anupam Joshi.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Linked DataTables Automatically Generating Linked Data from Tables Varish Mulwad University of Maryland, Baltimore County November 15, 2011.
Machine Learning Approach for Ontology Mapping using Multiple Concept Similarity Measures IEEE/ACIS International Conference on Computer and Information.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Logics for Data and Knowledge Representation
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Master Informatique 1 Semantic Technologies Part 11Direct Mapping Werner Nutt.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
Problems in Semantic Search Krishnamurthy Viswanathan and Varish Mulwad {krishna3, varish1} AT umbc DOT edu 1.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Semi-Automatic Quality Assessment of Linked Data without Requiring Ontology Saemi Jang, Megawati, Jiyeon Choi, and Mun Yong Yi KIRD, KAIST NLP&DBPEDIA.
Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad University of Maryland, Baltimore County.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Using linked data to interpret tables Varish Mulwad September 14,
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Using linked data to interpret tables Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County November 8, 2010.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
CityStateMayorPopulation BaltimoreMDS.C.Rawlings-Blake637,418 SeattleWAM.McGinn617,334 BostonMAT.Menino645,169 RaleighNCC.Meeker405,791 We are laying a.
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore.
Linked Data for the Rest of Us Tim Finin, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 12 January 2012
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Saisai Gong, Wei Hu, Yuzhong Qu
Wikitology Wikipedia as an Ontology
Text Based Similarity Metrics and Delta for Semantic Web Graphs
UMBC AN HONORS UNIVERSITY IN MARYLAND
Intent-Aware Semantic Query Annotation
Data Integration for Relational Web
Property consolidation for entity browsing
A Graph-Based Approach to Learn Semantic Descriptions of Data Sources
Intent-Aware Semantic Query Annotation
Summarization for entity annotation Contextual summary
Presentation transcript:

T2LD – An automatic framework for extracting, interpreting and representing tables as linked data Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim Finin June 29,

Contribution - Tables to Linked Data Link Cell Value to an entity Find Relationships between columns logy/PopulatedPlace LargestCity 2

@prefix dbpedia-owl:. is rdfs:label of dbpedia-owl:City. is rdfs:label of dbpedia-owl:AdminstrativeRegion. is rdfs:label of dbpedia:Baltimore. dbpedia:Baltimore a dbpedia-owl:City. … Contribution - Tables to Linked Data 3

A thousand reasons why it’s important… 1.Generate linked RDF for the Semantic Web 2.Enrich facts and knowledge that is already existing on the Semantic Web 3.Add new facts and knowledge in the Semantic Web 4.Possible use in completing “incomplete tables” 5.Use in expanding the attributes / columns of a table … and 995 other applications (or more) that will exploit this data 4

Overview Introduction Related Work & Motivation Tables to linked data Results Future Work Conclusion 5

Introduction 6

The World Wide Web … ……………… ……………… ……………… ……………… ……………… ……………… Talk: abc By: xyz Venue: some location Talk: abc By: xyz Venue: some location ……………… ……………… Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 7

The World Wide Web … Good for you and me … … not so good for machines Images from Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 8

Web of Data – The Semantic Web Image – Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 9

Linked Data The principles of Linked Data outline the best practices to share and expose structured data on the World Wide Web. Every resource has a URI: Baltimore: Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 10

Related Work and Motivation 11

Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 12

Chicken ? No – Egg … No – Chicken … More than a trillion documents on the Web ~ 14.1 billion tables, 154 million with high quality relational data (Cafarella et al. 2008) Where is structured data ? Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 13

Automate the process We need systems that can generate data from existing sources Not practical for humans to encode all this into RDF manually Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 14

In Databases and Web Systems … Understanding tables for Data Integration (Ziegler & Dittrich 2004), (Pantel, Philpot, & Hovy 2005) Learning to index tables to improve search experience (Cafarella et al. 2008) Expanding attributes (columns) of web tables (Lin et al. 2010) Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 15

On the Semantic Web Database to Ontology mapping (Barrasa, scar Corcho, & Gmez-prez 2004), (Hu & Qu 2007), (Papapanagiotou et al. 2006), and (Lawrence 2004) W3C working group – RDB2RDF !!! First working draft – June 8, 2010 Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 16

On the Semantic Web Mapping spreadsheets to RDF Systems like RDF123 (Han et. al 2008) allows users to convert spreadsheets to RDF Such systems are practical and helpful but … – Require significant manual work – Do not generate linked data Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 17

On the Semantic Web Han et. al 2009, addressed the problem of recommending a set of terms to use to describe the objects and relationships in the table Did not focus on the overall interpretation of a table Did not attempt to understand and link cell values Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 18

@prefix dbpedia-owl:. is rdfs:label of dbpedia-owl:City. is rdfs:label of dbpedia-owl:AdminstrativeRegion. is rdfs:label of dbpedia:Baltimore. dbpedia:Baltimore a dbpedia-owl:City. … An overall interpretation Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 19

Tables to Linked Data 20

T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations T2LD Framework Input: Table Headers and Rows Output: Linked Data Representation of a Table Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 21

An overview Query Knowledge base Predict Class for Columns Re query Knowledge base using the new evidence Link cell value to an entity using the new results obtained Input: Table Headers and Rows Identify Relationships between columns Output: Linked Data Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 22

T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Input: Table Headers and Rows Output: Linked Data Representation of a Table Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 23

Querying the Knowledge–Base For every cell from the column – Cell Value + Column Header + Row Content Top N entities, Their Types, Google Page Rank (We use N = 5) Wikitology Yago Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 24 City Baltimore Boston New York Type Instance

Querying the Knowledge–Base City Baltimore Boston New York 1.Baltimore, Types, Page Rank 2. Baltimore County, Maryland, Types, Page Rank 3. John Baltimore, Types, Page Rank 1. Boston, Types, Page Rank 2. Boston_(band), Types, Page Rank 3. Boston_University, Types, Page Rank 1. New_York_City, Types, Page Rank 2. New_York, Types, Page Rank 3. New_York_(album), Types, Page Rank Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 25

Set of Classes Types for Baltimore {dbpedia-owl:Place, dbpedia- owl:Area} Types for Baltimore County {yago:AmericanConduc tors,yago:LivingPeople} Types for John Baltimore {dbpedia-owl:Place, dbpedia- owl:Area} Types for Boston {dbpedia-owl:Place, dbpedia- owl:PopulatedPlace} Types for Boston_band {dbpedia-owl:Band, dbpedia- owl:Organisation}... Set of classes for a column: {dbpedia-owl:Place, dbpedia-owl:Area, yago:AmericanConductors, yago:LivingPeople, dbpedia- owl:PopulatedPlace, dbpedia-owl:Band, dbpedia-owl:Organisation,... } Set of classes for a column: {dbpedia-owl:Place, dbpedia-owl:Area, yago:AmericanConductors, yago:LivingPeople, dbpedia- owl:PopulatedPlace, dbpedia-owl:Band, dbpedia-owl:Organisation,... } Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 26

Ranking the Classes [Baltimore, dbpedia-owl:Place] [Boston, dbpedia-owl:Place] [New York, dbpedia-owl:Place] [Baltimore, dbpedia-owl:PopulatedPlace] [Boston, dbpedia-owl:PopulatedPlace] … [Baltimore, dbpedia-owl:Band] … [Baltimore, dbpedia-owl:Place] [Boston, dbpedia-owl:Place] [New York, dbpedia-owl:Place] [Baltimore, dbpedia-owl:PopulatedPlace] [Boston, dbpedia-owl:PopulatedPlace] … [Baltimore, dbpedia-owl:Band] … Create a pairing of all the class labels and strings in a column Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 27

Ranking the Classes Assign a score to every pair based on – – The entity’s rank that matches the class label – Predicted Google Page Rank We use the following formula – – Score = w x ( 1 / R ) + (1 – w) (Normalized Google Page Rank) – We use w = 0.25 Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 28

Ranking the Classes E.g. Processing class – “dbpedia:Area” String Baltimore: (R = 1) Baltimore {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 6] (R = 2) Baltimore County {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 4] (R = 3) John Baltimore {yago:AmericanConductors,yago:LivingPeople} [PR = 5] Score = w x ( 1 / R ) + (1 – w) x (Normalized Page Rank) [Baltimore, dbpedia:Area] = (0.25 x 1 / 1 ) + (0.75 x 6 / 7) = E.g. Processing class – “dbpedia:Band” String Baltimore: (R = 1) Baltimore {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 6] (R = 2) Baltimore County {dbpedia-owl:Place, dbpedia-owl:Area} [PR = 4] (R = 3) John Baltimore {yago:AmericanConductors,yago:LivingPeople} [PR = 5] [Baltimore, dbpedia:Band] = 0 [Since the class does not match any of the entities for Baltimore] Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 29

Predicting the Classes Select the class that maximizes its sum of score over the entire column E.g. Sum of dbpedia:Area – [Baltimore, dbpedia:Area] + [Boston, dbpedia:Area] + [New York, dbpedia:Area] = 2.85 Sum of dbpedia:Band – [Baltimore, dbpedia:Band] + [Boston, dbpedia:Band] + [New York, dbpedia:Band] = 0.25 Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 30

Predicting the Classes We predict classes from four vocabularies – DBpedia Ontology, Freebase, WordNet and Yago [City, dbpedia:Area] = 1 [City, dbpedia:PoplulatedPlace] = 0.9 [City, dbpedia:Band] = 0.2 [City, yago:LivingPeople] = 0.23 [City, dbpedia:Area] = 1 [City, dbpedia:PoplulatedPlace] = 0.9 [City, dbpedia:Band] = 0.2 [City, yago:LivingPeople] = 0.23 Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 31

The underlying query process … 32

Mapping Table to Wikipedia StateCapital CityLargest CityGovernor MarylandAnnapolisBaltimoreMartin O Malley Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 33

Mapping Table to Wikipedia StateCapital CityLargest CityGovernor MarylandAnnapolisBaltimoreMartin O Malley TypesLinked ConceptsProperty Values Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 34

Summary of the Query Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 35

Extracting Types from DBpedia Types for Annapolis SPARQL Query Query redirects too … … to avoid disparity in KBs Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 36

T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Input: Table Headers and Rows Output: Linked Data Representation of a Table Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 37

Approach Table Cell + Column Header + Row Data + Column Type Requery KB with predicted class labels as additional evidence Generate a feature vector for the top N results of the query Classifier ranks the entities within the set of possible results Select the highest ranked entity Classifier decides whether to link or not Link to “NIL” Link to the top ranked instance Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 38

Class labels are mapped to typesRef field Re-querying KB Use of predicted class labels as “additional evidence” WordNet:City Yago:CitiesinUnitedStates Freebase:Location Restricts the types of the results returned to the predicted class labels Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 39

Summary of the re-query Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 40

Learning to Rank We trained a SVM rank classifier which learnt to rank entities within a given set Feature Vector Similarity Measures Popularity Measures Levenshtein distance Dice Score Wikitology Score PageRank Page Length Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 41

“To Link or not to Link … ’’ The highest ranked entity may not the correct one to link to … – Because the string we are querying may not be in the KB – Top N results may not include the correct answer We trained an SVM classifier which would determine whether to link to the top one or not Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 42

“To Link or not to Link … ’’ Feature vector included the feature vector of the top ranked entity and additional two features – – The SVM rank score of the top ranked entity – The difference in scores between the top two ranked entities Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 43

T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Input: Table Headers and Rows Output: Linked Data Representation of a Table Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 44

Relation between columns City Baltimore Boston New York State Maryland Massachusetts New York Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 45

Relation between columns Maryland - Baltimore Massachusetts - Boston New York - New York dbonto:LargestCity dbonto:Capital dbonto:LargestCity dbonto:Capital dbonto:LargestCity Candidate relations Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 46

Scoring the relations Maryland - Baltimore Massachusetts - Boston New York - New York dbonto:LargestCity dbonto:Capital dbonto:LargestCity Candidates: dbonto:Capital dbonto:LargestCity dbonto:Capital Score:0 dbonto:Capital Score:1 dbonto:LargestCity Score:3 Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 47

Relation between columns Select * where { ?relation } _______________________________________________________________________ Select * where { ?relation } Query the second column as URI and a literal string Check all redirects when querying with URI Check all other common names when querying with literal string Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 48

T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations Input: Table Headers and Rows Output: Linked Data Representation of a Table Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 49

An dbpprop:. is rdfs:label of dbpedia-owl:City. is rdfs:label of dbpedia-owl:AdminstrativeRegion. is rdfs:label of dbpedia:Baltimore. dbpedia:Baltimore a dbpedia-owl:City. is rdfs:label of dbpedia:Maryland. dbpedia:Maryland a dbpedia-owl:AdministrativeRegion. dbpprop:LargestCity rdfs:domain dbpedia-owl:AdminstrativeRegion. dbpprop:LargestCity rdfs:range dbpedia-owl:City. is rdfs:label of dbpedia-owl:City. “City” is the common / human name for the class dbpedia-owl:City dbpedia:Baltimore a dbpedia-owl:City. dbpedia:Baltimore is a type (instance) dbpedia-owl:City dbpprop:LargestCity rdfs:domain dbpedia-owl:AdminstrativeRegion.  The subjects of the triples using the property have to be instances of dbpedia- owl:AdminstrativeRegion dbpprop:LargestCity rdfs:range dbpedia-owl:City.  The objects of the triples using the property have to be instances of dbpedia-owl:City Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 50

rdfs:. “ColumnHeader1” is rdfs:label of PredictedClassLabel1. “ColumnHeader2” is rdfs:label of PredictedClassLabel2. “TableCellString” is rdfs:label of CellValueURL. CellValueURL a PredictedClassLabel. property rdfs:domain PredictedClassLabel1. property rdfs:range PredictedClassLabel2. Where: ColumnHeader - is a column header from the table TableCellString - is a string representing a table cell PredictedClassLabel - is the class label associated with the column CellValueURL - is the DBpedia url, the table cell string is linked to property - is the relation discovered between the two columns Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 51

Results 52

Dataset summary Number of Tables15 Total Number of rows199 Total Number of columns56 (52) Total Number of entities639 (611) * The number in the brackets indicates # excluding columns that contained numbers Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 53

Dataset summary Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 54

Dataset summary Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 55

Evaluation for class label predictions 56

Evaluation # 1 (MAP) Compared the system’s ranked list of labels against a human ranked list of labels Metric - Mean Average Precision (MAP) Commonly used in the Information Retrieval domain to compare two ranked sets Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 57

Evaluation # 1 (MAP) MAP is defined as – R(n) - is the relevance at n. If the class label ranked “n” in the system generated set is a relevant one then R(n) is 1,else it is 0. P(n) - is the precision at n. It measures the relevance of the top n results. N - is the number of labels retrieved. For our evaluation we consider the top 3 labels retrieved. Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 58

Evaluation # 1 (MAP) Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion % System Ranked: 1. Person 2. Politician 3. President Evaluator Ranked: 1. President 2. Politician 3. OfficeHolder

Evaluation # 2 (Recall) Checked whether the system was retrieving relevant class labels or not. Measure used : Recall (R) Top three labels ranked by the user were considered to be relevant. Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 60

Evaluation # 2 (Recall) Recall > 0.6 (75 %) Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 61 System Ranked: 1. Person 2. Politician 3. President Evaluator Ranked: 1. President 2. Politician 3. OfficeHolder

Evaluation # 3 (Rank Match) A comparison of how many times the top three ranked system generated labels match with the top three labels ranked by the users Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 62

Evaluation # 4 (Correctness) Evaluated whether our predicted class labels were “fair and correct” Class label may not be the most accurate one, but may be correct. – E.g. dbpedia:PopulatedPlace is not the most accurate, but still a correct label for column of cities Three human judges evaluated our predicted class labels Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 63

Evaluation # 4 (Correctness) A category-wise breakdown for class label correctness Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion Overall Accuracy: % 64 Column – Nationality Prediction – MilitaryConflict Column – Birth Place Prediction – PopulatedPlace

Summary – Class label prediction Recall and class label correctness show that our approach produces relevant and correct labels MAP and Rank Match show that we enjoyed moderate success in ranking labels within a set Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 65

Evaluation for linking table cells to entities 66

Category-wise accuracy for linking table cells Overall Accuracy: % Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 67

SVM rank classifier Correctly predicted top ranked instance 215 Incorrectly predicted top ranked instance 7 Total number of instances 222 Accuracy96.84 % Correctly predicted top ranked instance 543 Incorrectly predicted top ranked instance 68 Total number of instances 611 Accuracy88.87 % Training data – 171 queries (each with 10 results) The correct entity was assigned the highest rank and all others were assigned a same lower rank Test data – 222 queries (each with 10 results) Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 68

The binary SVM classifier Correctly predicted 145 Incorrectly predicted 26 Total number of instances 171 Accuracy84.79 % Correctly predicted 541 Incorrectly predicted 70 Total number of instances 611 Accuracy88.54 % Training data – 222 queries (146 +ve, 76 –ve examples) If the highest ranked instance was correct, a class label of “yes” was assigned Test data – 171 queries (119 +ve, 52 –ve examples) Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 69

Evaluation for relation between columns 70

Relation between columns Idea – Ask human evaluators to identify relations between columns in a given table Pilot Experiment – Asked three evaluators to annotate five random tables from our dataset Evaluators identified 20 relations Our accuracy – 5 out of 20 (25 % ) were correct Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 71

Future Work 72

Future Work Implement a machine learning based approach for class label predictions for columns Alternative approach for relation discovery and identification Approaches to handle unknown entities Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 73

Conclusion 74

Conclusion There’s lot of data that is stored in html tables, spreadsheets, databases and documents We presented an automated framework that extracts, interprets and represents tables as linked data We are unlocking large amounts of tabular data currently inaccessible and useless for the Semantic Web and making it more meaningful and useful on the Semantic Web We believe our work will contribute in materializing the web of data vision Introduction  Related Work  Tables to Linked Data  Results  Future Work  Conclusion 75

References Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., Zhang, Y., Webtables:exploring the power of tables on the web. Proc. VLDB Endow.1 (1), Ziegler, P., and Dittrich, K. R Three decades of data intecration: all problems solved? In Building the Information Society, volume 156 of IFIP International Federation for Information Processing, 312. Springer Boston. Pantel, P.; Philpot, A.; and Hovy, E Aligning database columns using mutual information. In Proceedings of the 2005 national conference on Digital government research, dg.o 05, Cindy Xide Lin, Bo Zhao, Tim Weninger, Jiawei Han, and Bing Liu Entity relation discovery from web tables and links. In Proceedings of the 19th international conference on World wide web (WWW '10). ACM, New York, NY, USA, Barrasa, J., Corcho, O., Gomez-perez, A., R2o, an extensible and semantically based database-to-ontology mapping language. In Proceedings of the 2nd Workshop on Semantic Web and Databases(SWDB2004). Vol pp

Hu, W., and Qu, Y Discovering simple mappings between relational database schemas and ontologies. In Aberer, K.; Choi, K.-S.; Noy, N. F.; Allemang, D.; Lee, K.- I.; Nixon, L. J. B.; Golbeck, J.; Mika, P.; Maynard, D.; Mizoguchi, R.; Schreiber, G.;and Cudre-Mauroux, P., eds., ISWC/ASWC, volume 4825 of Lecture Notes in Computer Science, Springer. Papapanagiotou, P.; Katsiouli, P.; Tsetsos, V.; Anagnostopoulos, C.; and Hadjiefthymiades, S Ronto: Relational to ontology schema matching. In AISSIGSEMIS BULLETIN. Lawrence, E. D. R Composing mappings between schemas using a reference ontology. In In Proceedings of International Conference on Ontologies, Databases and Application of Semantics (ODBASE), Springer Han, L.; Finin, T.; Parr, C.; Sachs, J.; and Joshi, A RDF123: from Spreadsheets to RDF. In Seventh International Semantic Web Conference. Springer. Han, L., Finin, T., Yesha, Y., Finding semantic web ontology terms from words. In: Proceedings of the Eight International Semantic Web Conference. Springer. References 77

Discussion 78