Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

DCMI Workshop on Metadata and Search Vendor Panel Presentation Bradley P. Allen
A centre of expertise in digital information management Approaches To The Validation Of Dublin Core Metadata Embedded In (X)HTML Documents Background The.
TU e technische universiteit eindhoven / department of mathematics and computer science Modeling User Input and Hypermedia Dynamics in Hera Databases and.
TU/e technische universiteit eindhoven Hera: Development of Semantic Web Information Systems Geert-Jan Houben Peter Barna Flavius Frasincar Richard Vdovjak.
The Semantic Web. The Web Today Designed for Human to read Cannot express meaning Architecture: URL –Decentralized: Link structure Language: html.
Embedding Knowledge in HTML Some content from a presentations by Ivan Herman of the W3c.
WebRatio BPM: a Tool for Design and Deployment of Business Processes on the Web Stefano Butti, Marco Brambilla, Piero Fraternali Web Models Srl, Italy.
CS652 Spring 2004 Summary. Course Objectives  Learn how to extract, structure, and integrate Web information  Learn what the Semantic Web is  Learn.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Semantic Search Jiawei Rong Authors Semantic Search, in Proc. Of WWW Author R. Guhua (IBM) Rob McCool (Stanford University) Eric Miller.
ModelicaXML A Modelica XML representation with Applications Adrian Pop, Peter Fritzson Programming Environments Laboratory Linköping University.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
WebRatio BPM: a Tool for Design and Deployment of Business Processes on the Web Stefano Butti, Marco Brambilla, Piero Fraternali Web Models Srl, Italy.
1 On Embedding Machine-Processable Semantics into Documents Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University.
CS 290C: Formal Models for Web Software Lecture 6: Model Driven Development for Web Software with WebML Instructor: Tevfik Bultan.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Information Extraction from Documents for Automating Softwre Testing by Patricia Lutsky Presented by Ramiro Lopez.
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
Using SQL Queries to Generate XML- Formatted Data Joline Morrison Mike Morrison Department of Computer Science University of Wisconsin-Eau Claire.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Improving Data Discovery in Metadata Repositories through Semantic Search Chad Berkley 1, Shawn Bowers 2, Matt Jones 1, Mark Schildhauer 1, Josh Madin.
Metadata and identifiers for e- journals Copenhagen Juha Hakala Helsinki University Library
Semantic Web Technologies ufiekg-20-2 | data, schemas & applications | lecture 21 original presentation by: Dr Rob Stephens
Adapting Legacy Computational Software for XMSF 1 © 2003 White & Pullen, GMU03F-SIW-112 Adapting Legacy Computational Software for XMSF Elizabeth L. White.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Towards Translating between XML and WSML based on mappings between.
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
XML BIS4430 – unit 10. XML Origins Extensible Markup Language (XML) 1998 Inspired by Standard Generalized Markup Language (SGML) and HTML. SGML defines.
Introduction to MDA (Model Driven Architecture) CYT.
Assessing the Suitability of UML for Modeling Software Architectures Nenad Medvidovic Computer Science Department University of Southern California Los.
Interoperability in Information Schemas Ruben Mendes Orientador: Prof. José Borbinha MEIC-Tagus Instituto Superior Técnico.
PART IV: REPRESENTING, EXPLAINING, AND PROCESSING ALIGNMENTS & PART V: CONCLUSIONS Ontology Matching Jerome Euzenat and Pavel Shvaiko.
SWETO: Large-Scale Semantic Web Test-bed Ontology In Action Workshop (Banff Alberta, Canada June 21 st 2004) Boanerges Aleman-MezaBoanerges Aleman-Meza,
CSCI 1101 Intro to Computers 7.1 Learning HTML. 2 Introduction Web pages are written using HTML Two key concepts of HTML are:  Hypertext (links Web pages.
XML A web enabled data description language 4/22/2001 By Mark Lawson & Edward Ryan L’Herault.
Košice, 10 February Experience Management based on Text Notes The EMBET System Michal Laclavik.
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
Ontologies and Lexical Semantic Networks, Their Editing and Browsing Pavel Smrž and Martin Povolný Faculty of Informatics,
WEB BASED DATA TRANSFORMATION USING XML, JAVA Group members: Darius Balarashti & Matt Smith.
TUTORIAL Dolphy A. Fernandes Computer Science & Engg. IIT Bombay.
Dimitrios Skoutas Alkis Simitsis
Model Driven Development An introduction. Overview Using Models Using Models in Software Feasibility of MDA MDA Technologies The Unified Modeling Language.
Department of computer science and engineering Two Layer Mapping from Database to RDF Martin Švihla Research Group Webing Department.
8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Embedding Knowledge in HTML Some content from a presentations by Ivan Herman of the W3c.
XML and Database.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Storage dashboard Status report A.Baranovski 12/10/07.
THE SEMANTIC WEB By Conrad Williams. Contents  What is the Semantic Web?  Technologies  XML  RDF  OWL  Implementations  Social Networking  Scholarly.
 Web pages originally static  Page is delivered exactly as stored on server  Same information displayed for all users, from all contexts  Dynamic.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Sample Talks for Organizational Hints Krishnaprasad Thirunarayan Department of Computer Science and Engineering Wright State University Dayton, OH
From XML to DAML – giving meaning to the World Wide Web Katia Sycara The Robotics Institute
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Yu, et al.’s “A Model-Driven Development Framework for Enterprise Web Services” In proceedings of the 10 th IEEE Intl Enterprise Distributed Object Computing.
Ontologies Reasoning Components Agents Simulations An Overview of Model-Driven Engineering and Architecture Jacques Robin.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
The AstroGrid-D Information Service Stellaris A central grid component to store, manage and transform metadata - and connect to the VO!
Of 24 lecture 11: ontology – mediation, merging & aligning.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
Mechanisms for Requirements Driven Component Selection and Design Automation 최경석.
1 Introduction to XML Babak Esfandiari. 2 What is XML? introduced by W3C in 98 Stands for eXtensible Markup Language it is more general than HTML, but.
Web Service Modeling Ontology (WSMO)
General Adaptation Framework
Presentation transcript:

Formalizing and Querying Heterogeneous Documents with Tables Krishnaprasad Thirunarayan and Trivikram Immaneni Department of Computer Science and Engineering Wright State University Dayton, OH-45435

Overall R&D Agenda Develop semi-automatic techniques for information extraction/retrieval to enable man and machine to complement each other in assimilation of semi-structured, heterogeneous documents => Semantic Web Technologies.

Goal (What?) Background and Motivation (Why?) Implementation Details (How?) Evaluation and Applications (Why?) Conclusions

Goal

Define, embed, and use metadata in semi- structured documents containing tables. Content-oriented/domain-specific annotation of human sensible document Makes explicit semantics of complex data Enables augmentation of an interpretation in a modular fashion.

Heterogeneous Document

Background and Motivation

Generate XML Master Document that is both machine processable and that can serve as a basis for human sensible presentation. Basis of semi-automation in practice.

Embedding metadata improves traceability, thereby facilitating Content Extraction Verification Update

Implementation Details (How?)

XML Technology Document-Centric View: XML is used to annotate documents for use by humans in the realm of document processing and content extraction. Data-Centric View: XML is used as text- based format for information exchange / serialization in the context of Web Services.

Basic idea behind our approach Unify the two views by using XML- elements to materialize abstract syntax, and together with XML attributes and XML element definitions, formalize the content.  Key advantage: Minimizes maintenance of additional data structures to relate original document with its formalization.

Two Concrete Implementations Use Web Services language Water which amalgamates XML Technology with programming language concepts Use XML/XSLT infrastructure

Water-based approach Each annotation reflects the semantics of the text fragment it encloses. The annotated data can be interpreted by viewing it as a function/procedure call in Water. The correspondence between formal parameter and actual argument is position-based. The semantics of annotation is defined in Water as a method definition in a class, separately.

Example Table Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi) 0.50 and under – –

Example of Tagged Table Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi) table and under table table table....

Example of Processing Code /> <set rows= table.rows. />/> …

XML/XSLT-based approach Each annotation reflects the semantics of the text fragment it encloses. To make the annotated data XML compliant, dummy attributes such as one, two, three, … etc are introduced. The correspondence between formal attribute and the actual value is name-based. The semantics is defined modularly by interpreting XML- elements and its XML-attributes via XSLT, separately.

Example of Tagged Table <tableSchema one="Thickness(min)" two="Thickness(max)" three="Tensile Strength“ four="Yield Strength"/>...

XSLT Stylesheets can be used to: Query: to perform table look-ups. Transform: to change units of measure such as from standard SI units to FPS units and vice versa. Format: to display the table in HTML form. Extract: to recover the original table. Verify: to check static semantic constraints on table data values.

Evaluation and Application (Why?)

Advantage Only tabular data in each document is annotated. The annotation definition is factored out as background knowledge. Thus, the semantics of each table type is specified just once outside the document and is reused with different documents containing similar tables.

Disadvantage Both avenues require mature tool support for wide spread adoption. For example, develop MS FrontPage like interface where the Master document is the annotated form, and the user explicitly interacts with/edits only a view of the annotated document, for readability reasons, and has support for export as XML to generate well-formed XML document.

Prolog rendition strengthTableRow( 0, 0.50, 165, 155). strengthTableRow(0.50, 1.00, 160, 150). strengthTableRow(1.00, 1.50, 155, 145).... strengthTable(Thickness, TensileStrength, YieldStrength) :- strengthTableRow(L, U, TensileStrength, YieldStrength), L = Thickness. thicknessToTensileStrength(Thickness, TensileStrength) :- strengthTable(Thickness, TensileStrength, _). thicknessToYieldStrength(Thickness, YieldStrength) :- strengthTable(Thickness, _, YieldStrength). ?- thicknessToYieldStrength(0.6,YS).

Conclusion and Future Work

Develop a catalog of predefined tables, specifying them using Semantic Web formalisms (such as RDF, OWL, etc) and mapping the tabular data into a set of pre- defined tables, possibly qualified. Develop techniques for manual mapping of complex tables into simpler ones: To provide semantics to data. To improve traceability. To facilitate automatic manipulation.

Tailor and improve IE and IR techniques developed in the context of text processing to Semantic Web documents such as in XML, RDF, etc benefiting from additional support from ontologies such as in OWL, etc

Holy Grail Ultimately develop principles, techniques and tools, to author and extract human-readable and machine-comprehensible parts of a document hand in hand, and keep them side by side.