Semi-Automatic Quality Assessment of Linked Data without Requiring Ontology Saemi Jang, Megawati, Jiyeon Choi, and Mun Yong Yi KIRD, KAIST NLP&DBPEDIA.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,
Leveraging Data and Structure in Ontology Integration Octavian Udrea 1 Lise Getoor 1 Renée J. Miller 2 1 University of Maryland College Park 2 University.
Multi-Phase Reasoning of temporal semantic knowledge Sakirulai O. Isiaq and Taha Osman School of Computer and Informatics Nottingham Trent University Nottingham.
 Andisheh Keikha Ryerson University Ebrahim Bagheri Ryerson University May 7 th
Building and Analyzing Social Networks Web Data and Semantics in Social Network Applications Dr. Bhavani Thuraisingham February 15, 2013.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
The Web of data with meaning... By Michael Griffiths.
Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.
OWL-AA: Enriching OWL with Instance Recognition Semantics for Automated Semantic Annotation 2006 Spring Research Conference Yihong Ding.
Sensemaking and Ground Truth Ontology Development Chinua Umoja William M. Pottenger Jason Perry Christopher Janneck.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
Semantic Web Presented by: Edward Cheng Wayne Choi Tony Deng Peter Kuc-Pittet Anita Yong.
1 DCS861A-2007 Emerging IT II Rinaldo Di Giorgio Andres Nieto Chris Nwosisi Richard Washington March 17, 2007.
 Copyright 2009 Digital Enterprise Research Institute. All rights reserved Digital Enterprise Research Institute Ontologies & Natural Language.
A Really Brief Crash Course in Semantic Web Technologies Rocky Dunlap Spencer Rugaber Georgia Tech.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
revised CmpE 583 Fall 2006Discussion: OWL- 1 CmpE 583- Web Semantics: Theory and Practice DISCUSSION: OWL Atilla ELÇİ Computer Engineering.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.
INF 384 C, Spring 2009 Ontologies Knowledge representation to support computer reasoning.
Logics for Data and Knowledge Representation
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
© Paul Buitelaar – November 2007, Busan, South-Korea Evaluating Ontology Search Towards Benchmarking in Ontology Search Paul Buitelaar, Thomas.
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Dimitrios Skoutas Alkis Simitsis
A Classification of Schema-based Matching Approaches Pavel Shvaiko Meaning Coordination and Negotiation Workshop, ISWC 8 th November 2004, Hiroshima, Japan.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
MyActivity: A Cloud-Hosted Ontology-Based Framework for Human Activity Querying Amin BakhshandehAbkear Supervisor:
Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,
Towards Distributed Information Retrieval in the Semantic Web: Query Reformulation Using the Framework Wednesday 14 th of June, 2006.
Aligner automatiquement des ontologies avec Tuesday 23 rd of January, 2007 Rapha ë l Troncy.
Using linked data to interpret tables Varish Mulwad September 14,
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Shridhar Bhalerao CMSC 601 Finding Implicit Relations in the Semantic Web.
Ontology based Information Extraction
Emerging Trend Detection Shenzhi Li. Introduction What is an Emerging Trend? –An Emerging Trend is a topic area for which one can trace the growth of.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Extending the MDR for Semantic Web November 20, 2008 SC32/WG32 Interim Meeting Vilamoura, Portugal - Procedure for the Specification of Web Ontology -
ELIS – Multimedia Lab PREMIS OWL Sam Coppens Multimedia Lab Department of Electronics and Information Systems Faculty of Engineering Ghent University.
RDF & SPARQL Introduction Dongfang Xu Ph.D student, School of Information, University of Arizona Sept 10, 2015.
Paloma Marín Arraiza 17 th International Conference on Grey Literature 1 st and 2 nd December 2015, Amsterdam (Netherlands) SCIENTIFIC AUDIOVISUAL MATERIALS.
1 A Medical Information Management System Using the Semantic Web Technology Networked Computing and Advanced INFORMATION MANAGEMENT, NCM '08. Fourth.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Introduction to the Semantic Web Jeff Heflin Lehigh University.
Semantic Web 06 T 0006 YOSHIYUKI Osawa. Problem of current web  limits of search engines Most web pages are only groups of character strings. Most web.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
1 Intelligent Information System Lab., Department of Computer and Information Science, Korea University Semantic Social Network Analysis Kyunglag Kwon.
SEMANTIC WEB Presented by- Farhana Yasmin – MD.Raihanul Islam – Nohore Jannat –
The Semantic Web By: Maulik Parikh.
SPARQL + RDF Based on: Prof. Benny Kimelfled’s lecture notes
Lifting Data Portals to the Web of Data
Analyzing and Securing Social Networks
Ontology.
Ontology Partition for Browsing
Extracting Semantic Concept Relations
CC La Web de Datos Primavera 2016 Lecture 2: RDF Model & Syntax
Property consolidation for entity browsing
[jws13] Evaluation of instance matching tools: The experience of OAEI
DBpedia 2014 Liang Zheng 9.22.
Question Answering & Linked Data
Ontology.
Information Networks: State of the Art
Summarization for entity annotation Contextual summary
Template-based Question Answering over RDF Data
Presentation transcript:

Semi-Automatic Quality Assessment of Linked Data without Requiring Ontology Saemi Jang, Megawati, Jiyeon Choi, and Mun Yong Yi KIRD, KAIST NLP&DBPEDIA 2015 WORKSHOP

Motivation DBpedia extracts structured information from Wikipedia example: Wikipedia page on Pope Saint Felix III dbpedia:Pope_Felix_III dbo:birthPlace dbpedia:Rome dbo:deathPlace dbpedia:Odoacer 2

Motivation Errors in DBpedia Incorrect data: type, datatype, value Ambiguity: URI, property Quality of the data has become important Quality of the data has become important rdf:type dbo:Place dbo:Person Error 3 dbpedia:Pope_Felix_III dbo:birthPlace dbpedia:Rome dbo:deathPlace dbpedia:Odoacer

Motivation Data Quality Assessment TripleCheckMate [3], LinkQA [6], WIQA [7], DaCura [8] Based on ontology that is built from target data (e.g. DBpedia) But It is not feasible to use for data having no ontology Ontology generation is a difficult and time consuming work Automatic ontology generation works for English and limited domains 4

Introduction Goal Quality assessment of linked data without requiring ontology Idea a large portion of the data in a knowledge resource is valid data Analyze the data patterns in resource, take the patterns appearing frequently Evaluate the quality based on the patterns 5

Overview of approach 6

Quality Assessment Criteria Data Quality Test Pattern (DQTP) DQTP = tuple(V,S) VS V is a set of typed pattern variables, S is a SPARQL query templet RDF triples (subject, predicate, object) Domain Domain is all possible types which can be contained by the subject Range Range is all possible types that can be contained by the object Literal values Literal values ensures a certain data type determined by the property used 7

Test Case Pattern Generation Algorithm PropertyObjectObject Type dbo:occupationdbr:Freddie_Mercuryfoaf:Person dbr:Michael_Jacksondbo:Person dbr:Alfred_Nobelfoaf:Person dbr:Alfred_Nobeldbo:Agent Knowledge Resource Check the pattern in knowledge resource STEP 1 Compute appearance ratio of each pattern STEP 2 Select top k pattern & Compute ratio STEP 3 Set threshold (average of top k ratio) STEP 4 Build test case pattern STEP 5 PropertyObjectObject Type dbo:Artistdbr:Freddie_Mercurydbo:Person dbr:Michael_Jacksondbo:Person dbr:Alfred_Nobelfoaf:Person dbr:Alfred_Nobeldbo:Agent PropertyObjectObject Type dbo:deathPlacedbr:Londonschema:Place dbr:Chicagodbo:Place dbr:Parisdbo:Wikidata:Q532 dbr:Seouldbo:Place Example: Range pattern (dbo:deathPlace) PropertyTop 5 typeRatio dbo:occupationdbo:Place schema:Place dbo:Wikidata:Q dbo:PopulatedPlace dbo:Settlement PropertyTop 5 typeRatio dbo:deathPlacedbo:Place schema:Place dbo:Wikidata:Q dbo:PopulatedPlace dbo:Settlement Average of top 5 ratio = Threshold (e.g. 17%) Test case pattern PropertyPattern type dbo:deathPlacedbo:Place, schema:Place, dbo:Wikidata:Q532 dbo:birthPlacedbo:Place, schema:Place, dbo:Wikidata:Q532 dbo:spousedbo:Person, foaf:Person, schema:Person 8

Evaluation of approach 1)Test Case Pattern Generation Compare the approach patterns and the benchmark patterns –Approach generate patterns without using ontology –Benchmark generate patterns using ontology 2)Quality Assessment Accuracy Evaluate a localized DBpedia which does not have ontology 9

Validation 1) Test Case Pattern Generation Ground truth RDFUnit [4] compiled a library of data quality test case patterns for quality assessment Ontology of English DBpedia Definition of Test Case Patterns ApproachRDFUnitDefinition Domain Quality Pattern (DQP) RDFSDOMAINThe attribution of a resource's property (with a certain value) is only valid if the resource is of a certain type. Range Quality Pattern (RQP) RDFSRANGEThe attribution of a resource's property is only valid if the value is of a certain type Datatype Quality Pattern (TQP) RDFSRANGEDThe attribution of a resource's property is only if the literal value has a certain datatype 10

Data Test Case Pattern Generation 5 Top 5 type average ratio is 22% for DQP, 17% for RQP For TQP, most of the triples has a single data pattern It generate patterns by triples in DBpedia, but RDFUnit using ontology Validation 1) Test Case Pattern Generation PropertyDQPRQPTQP English DBpedia PatternPropertyPattern type DQP dbo:deathPlace dbo:Agent, dbo:Person RQPdbo:Place, dbo:PopulatedPlace, dbo:Wikidata:Q DBpedia 2015 ( dbo,dbp)

Validation 1) Test Case Pattern Generation 12 DQPRQP TQP Total number of patterns with benchmark A: Pattern generation rate B: pattern generation accuracy of approach Total number of generated patterns with approach Total number of consistent patterns with approach

Validation 1) Test Case Pattern Generation 13 DQPRQP TQP In case of TQP, the patterns have equivalent meanings with RDFUnit. But they comes from different resources. e.g. rdf:langString, xsd:String

Validation 2) Quality Assessment Accuracy How to validate the quality assessment accuracy? Approach is able to handle a localized DBpedia and evaluate the quality of data Localized version of DBpedia in 125 languages do not have their ontologies Most of the label of DBpedia Ontology is composed of English label 14

Validation 2) Quality Assessment Accuracy Data Localized version of DBpedia (Korean DBpedia) 32 million triples with different properties 1070 localized properties that are carried by more than 100 triples Test Case Pattern Generation 5 Top 5 type average ratio is 18% for DQP, 16% for RQP For TQP, most of the triples has a single data pattern, not only datatype but also language tag PropertyDQPRQPTQP Korean DBpedia PatternPropertyPattern type DQP dbo: 죽은곳 (=deathPlace) dbo:Agent, dbo:Person RQPdbo:Place, dbo:PopulatedPlace, dbo:Wikidata:Q Korean DBpedia 2015

Validation 2) Quality Assessment Accuracy Result of Data Quality Assessment 1438 test case patterns generated by 1070 properties 1.4 million triples tested from Korean Dbpedia TotalDomainRangeDatatype TriplesTC PassErrorTCPassErrorTCPassError 1,492,3312,452,0231,470,3891,075,953394,436613,535176,423437,112368,099309,28658,813 Error rate26.82%71.24%15.97% 16

Validation 2) Quality Assessment Accuracy Gold standard data –Randomly selected 1000 triples (95% confidence, 3.5% error) –2 human evaluator (kappa ) –Annotate correct type of subject, object based on predicate Evaluation measure Precision, recall, and f1-measure Accuracy 17 TriplesPrecisionRecallF1-measure DQP RQP TQP

Validation 2) Error Analysis Error Analysis on Korean DBpedia The error occurrence rate of total triple is 36.31% The most error cases is rdf:range violation [3,4,18] Literal or string data, not URI Object range validation cannot be performed [4] 18 Pass 63.69% Error 36.31%

Validation 2) Error Analysis Error Analysis on Korean DBpedia Incorrect datatype setting date e.g. the date must be set as xs:date, but it is set to xs:integer Incorrect object value e.g. Object value of prop-ko: 활동기간 (=active period) is a period of time, but only the beginning point of the duration Property ambiguity e.g. prop-ko: 종목 (event) can have 2 totally different types on object - the name of event or the number of events 19

Limitations Lack of specific domain/range setting e.g. Quality assessment with only one triple e.g. 20 PropertyDQP dbo:deathPlacedbo:Agent, dbo:Person dbpedia:Michael_Jacksondbo:birthDate (xsd:date) dbo:deathDate (xsd:date) dbo:birthdate has to be earlier then dbo:deathDate

Conclusion Semi-automatically generates patterns from knowledge resource Patterns are instantiated into test cases to measure the quality of data more than 97% patterns are generated by approach This work opens a new possibility of conducting quality assessment without requiring ontology It can apply to any language and any domain 21

Ongoing works Utilizing external resources e.g. WordNet, Thesaurus Pattern expansion Create a complete validation system for determining trustworthiness 22

Questions? 23

Reference Linked data quality assessment [2] Quality assessment methodologies for linked open data. Zaveri, A. et al. Submitted to Semantic Web Journal (2013) [5] Weaving the pedantic web. Hogan, A. et al. (2010) [6] Assessing linked data mappings using network measures. Guéret et al. In The Semantic Web: Research and Applications (pp ). Springer Berlin Heidelberg (2012) [8] Improving curated web-data quality with structured harvesting and assessment. Feeney et al. International Journal on Semantic Web and Information Systems (IJSWIS), 10(2), (2014) [16] Swiqa-a semantic web information quality assessment framework. Fürber et al. In ECIS (Vol. 15, p. 19) (2011) [17] Using semantic web resources for data quality management. Fürber et al. In Knowledge Engineering and Management by the Masses (pp ). Springer Berlin Heidelberg (2010) 25

Reference Data Quality Assessment of DBpedia [3] User-driven quality evaluation of dbpedia. Zaveri, A. et al. In Proceedings of the 9th International Conference on Semantic Systems (pp ). ACM (2013) [4] Test-driven evaluation of linked data quality. Kontokostas et al. In Proceedings of the 23rd international conference on World Wide Web (pp ). ACM (2014) [18] Crowdsourcing linked data quality assessment. Acosta et al. In The Semantic Web{ISWC 2013 (pp ). Springer Berlin Heidelberg (2013) [19] Detecting incorrect numerical data in dbpedia. Wienand et al. In The Semantic Web: Trends and Challenges (pp ). Springer International Publishing (2014) [20] DL-Learner: learning concepts in description logics. Lehmann, J. The Journal of Machine Learning Research, 10, (2009) Automatic Ontology generation [13] Automatic ontology generation using schema information. Sie et al. In Web Intelligence, WI IEEE/WIC/ACM International Conference on (pp ). IEEE (2006) [14] Text2Onto. Cimiano et al. In Natural language processing and information systems (pp ). Springer Berlin Heidelberg (2005) [21] Automatic generation of OWL ontology from XML data source. Yahia et al. arXiv preprint arXiv: (2012) [24] A robust approach to aligning heterogeneous lexical resources. Pilehvar et al. AP A 1 (2014): c2. 26