Download presentation
Presentation is loading. Please wait.
Published bySabina Maria Sims Modified over 5 years ago
1
New (Applications of) Compiler Techniques for Data Grids
Gagan Agrawal
2
Outline Automatic Data Virtualization Automatic Wrapper Generation
SQL Implementation XML/XQuery Automatic Wrapper Generation Data Integration in Bioinformatics Compiling XML Query Language XQuery Issues with streaming data
3
Data Virtualization An abstract view of data dataset Data
Data Service -- Scientific Data being shared on Web/Grids -- Low-level layouts -- Need for efficient storage and processing
4
Our Approach: Automatic Data Virtualization
Automatically create data services A new application of compiler technology A meta-data descriptor describes the layout of data in a repository An abstract view is exposed to the users Two implementations: Relational /SQL-based (HPDC 2004, LCPC 2004) XML/XQuery based (ICS 2003, LCPC 2003)
5
SQL/Relational Implementation
SELECT < Data Elements > FROM < Dataset Name > WHERE …. AND Filter( < Data Element> );
6
XML/XQuery Implementation
??? XQuery HDF5 NetCDF XML TEXT RMDB …
7
Approach / Contributions
Use of XML Schemas to provide high-level abstractions on complex datasets Using XQuery with these Schemas to specify processing Issues in Translation High-level to low-level code Data-centric transformations for locality in low-level codes Issues specific to XQuery Recognizing recursive reductions Type inferencing and translation
8
Wrappers Goal: to provide the integration system transparent access to data sources Challenges Development cost Performance Scripting languages can be slow Updates Data Formats can change frequently
9
Our Approach Machine-interpretable metadata
A layout descriptor associated with each dataset Wrappers generated on the fly Applied to several bioinformatics examples
10
Layout Descriptor DATASET “FASTAData” { DATATYPE {FASTA}
Dataset name Schema name DATASET “FASTAData” { DATATYPE {FASTA} DATASPACE LINESIZE=80 { LOOP ENTRY 1:EOF:1 { “>” ID “ “ DESCRIPTION < “\n” SEQ > “\n” | EOF } } DATA {osu/fasta} ID DESCRIPTION >Example1 envelope protein ELRLRYCAPAGFALLKCNDA DYDGFKTNCSNVSVVHCTNL MNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKH >Example2 synthetic peptide HITREPLKHIPKERYRGTNDT… SEQ SEQ File layout SEQ SEQ File location
11
XQuery on Streaming Data
Infinite data streams All processing must be single pass Interesting Compiler Questions: How do I transform a code to execute on a single pass How to tell that it can be executed correctly with a single pass Addressed this problem for XML Streams and XML query language XQuery Appears in VLDB 2005
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.