New (Applications of) Compiler Techniques for Data Grids

Slides:



Advertisements
Similar presentations
Native XML Database or RDBMS. Data or Document orientation If you are primarily storing documents, then a Native XML Database may be the best option.
Advertisements

1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.
Supporting High-Level Abstractions through XML Technologies Xiaogang Li Gagan Agrawal The Ohio State University.
Automatic Data Ramon Lawrence University of Manitoba
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06.
HDF 1 NCSA HDF XML Activities Robert E. McGrath Mike Folk National Center for Supercomputing Applications.
TECHNIQUES FOR OPTIMIZING THE QUERY PERFORMANCE OF DISTRIBUTED XML DATABASE - NAHID NEGAR.
Chapter 1 Overview of Databases and Transaction Processing.
A Tool for Supporting Integration Across Multiple Flat-File Datasets Xuan Zhang, Gagan Agrawal Ohio State University.
XML Overview. Chapter 8 © 2011 Pearson Education 2 Extensible Markup Language (XML) A text-based markup language (like HTML) A text-based markup language.
Efficient Evaluation of XQuery over Streaming Data Xiaogang Li Gagan Agrawal The Ohio State University.
Astrogrid Resource Registry Querying the Registry 1.Mullard Space Science Laboratory, University College London, Holmbury St. Mary, Dorking, Surrey RH5.
Parser-Driven Games Tool programming © Allan C. Milne Abertay University v
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.
Dr. Mohamed Osman Hegazi 1 Database Systems Concepts Database Systems Concepts Course Outlines: Introduction to Databases and DBMS. Database System Concepts.
Big Data Vs. (Traditional) HPC Gagan Agrawal Ohio State ICPP Big Data Panel (09/12/2012)
High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.
SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†
Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on Non- Integer Iteration Spaces Swarup Kumar Sahoo Gagan.
Compiler Introduction 1 Kavita Patel. Outlines 2  1.1 What Do Compilers Do?  1.2 The Structure of a Compiler  1.3 Compilation Process  1.4 Phases.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
Compiler (and Runtime) Support for CyberInfrastructure Gagan Agrawal (joint work with Wei Du, Xiaogang Li, Ruoming Jin, Li Weng)
Language Implementation Methods David Woolbright.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
E-Science Data Information and Knowledge Transformation BinX – A Tool for Binary File Access eDIKT project team Ted Wen
Research Overview Gagan Agrawal Associate Professor.
Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal The Ohio State University.
Using XQuery for Flat-File Scientific Datasets Xiaogang Li Gagan Agrawal The Ohio State University.
ORM Basics Repository Pattern, Models, Entity Manager Ivan Yonkov Technical Trainer Software University
David Adams ATLAS AJDL: Abstract Job Description Language David Adams BNL June 29, 2004 PPDG Collaboration Meeting Williams Bay.
Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal.
Chapter 1 Overview of Databases and Transaction Processing.
XML 1. Chapter 8 © 2013 Pearson Education, Inc. Publishing as Prentice Hall SAMPLE XML SCHEMA (XSD) 2 Schema is a record definition, analogous to the.
Efficient Evaluation of XQuery over Streaming Data
CMIT100 Chapter 14 - Programming.
Chapter 5- Assembling , Linking, and Executing Programs
CSCI-235 Micro-Computer Applications
Middleware independent Information Service
课程名 编译原理 Compiling Techniques
Prepared for Md. Zakir Hossain Lecturer, CSE, DUET Prepared by Miton Chandra Datta
and Executing Programs
Database Management System (DBMS)
Chapter 2 Database Environment Pearson Education © 2009.
1.1 The Evolution of Database Systems
XML Data Introduction, Well-formed XML.
USER CENTRIC VIEW AND SYSTEM CENTRIC VIEW OF SYSTEM SOFTWARE
Database Systems Instructor Name: Lecture-3.
Grid Based Data Integration with Automatic Wrapper Generation
Learning Layouts of Biological Datasets Semi-Automatically
The Ohio State University
Xuan Zhang Kaushik Sinha Ruoming Jin Gagan Agrawal
Chaitali Gupta, Madhusudhan Govindaraju
Supporting High-Performance Data Processing on Flat-Files
point when a program element is bound to a characteristic or property
Chapter 2 Database Environment Pearson Education © 2009.
Use Cases Simple Machine Translation (using Rainbow)
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Chapter 2 Database Environment Pearson Education © 2009.
LCPC02 Wei Du Renato Ferreira Gagan Agrawal
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

New (Applications of) Compiler Techniques for Data Grids Gagan Agrawal

Outline Automatic Data Virtualization Automatic Wrapper Generation SQL Implementation XML/XQuery Automatic Wrapper Generation Data Integration in Bioinformatics Compiling XML Query Language XQuery Issues with streaming data

Data Virtualization An abstract view of data dataset Data Data Service -- Scientific Data being shared on Web/Grids -- Low-level layouts -- Need for efficient storage and processing

Our Approach: Automatic Data Virtualization Automatically create data services A new application of compiler technology A meta-data descriptor describes the layout of data in a repository An abstract view is exposed to the users Two implementations: Relational /SQL-based (HPDC 2004, LCPC 2004) XML/XQuery based (ICS 2003, LCPC 2003)

SQL/Relational Implementation SELECT < Data Elements > FROM < Dataset Name > WHERE …. AND Filter( < Data Element> );

XML/XQuery Implementation ??? XQuery HDF5 NetCDF XML TEXT RMDB …

Approach / Contributions Use of XML Schemas to provide high-level abstractions on complex datasets Using XQuery with these Schemas to specify processing Issues in Translation High-level to low-level code Data-centric transformations for locality in low-level codes Issues specific to XQuery Recognizing recursive reductions Type inferencing and translation

Wrappers Goal: to provide the integration system transparent access to data sources Challenges Development cost Performance Scripting languages can be slow Updates Data Formats can change frequently

Our Approach Machine-interpretable metadata A layout descriptor associated with each dataset Wrappers generated on the fly Applied to several bioinformatics examples

Layout Descriptor DATASET “FASTAData” { DATATYPE {FASTA} Dataset name Schema name DATASET “FASTAData” { DATATYPE {FASTA} DATASPACE LINESIZE=80 { LOOP ENTRY 1:EOF:1 { “>” ID “ “ DESCRIPTION < “\n” SEQ > “\n” | EOF } } DATA {osu/fasta} ID DESCRIPTION >Example1 envelope protein ELRLRYCAPAGFALLKCNDA DYDGFKTNCSNVSVVHCTNL MNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKH >Example2 synthetic peptide HITREPLKHIPKERYRGTNDT… SEQ SEQ File layout SEQ SEQ File location

XQuery on Streaming Data Infinite data streams All processing must be single pass Interesting Compiler Questions: How do I transform a code to execute on a single pass How to tell that it can be executed correctly with a single pass Addressed this problem for XML Streams and XML query language XQuery Appears in VLDB 2005