Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using XQuery for Flat-File Scientific Datasets Xiaogang Li Gagan Agrawal The Ohio State University.

Similar presentations


Presentation on theme: "Using XQuery for Flat-File Scientific Datasets Xiaogang Li Gagan Agrawal The Ohio State University."— Presentation transcript:

1 Using XQuery for Flat-File Scientific Datasets Xiaogang Li Gagan Agrawal The Ohio State University

2 Motivation The need Analysis of datasets is becoming crucial for scientific advances Emergence of X-Informatics! Complex data formats complicate processing Need for applications that are easily portable – compatibility with web/grid services The opportunity The emergence of XML and related technologies developed by W3C XML is already extensively used as part of Grid/Distributed Computing Can XML help in scientific data processing?

3 Motivation (Contd.) Traditionally, scientific datasets have not been stored/managed/processed using traditional DBMS High storage overhead Query languages not suitable Can this change with XML/XQuery ? Expressiveness of XQuery High storage overhead of XML still an issue

4 The Whole Picture TEXT ADR NetCDF RMDB HDF5 XML XQuer y ???

5 Our goals Support datasets of different formats - HDF5 - Netcdf - Chunked multi-dimensional datasets Ease of programming - provide high level abstraction of datasets - physical details are hidden from application developers Compiler optimizations for performance - physical details are exposed to compiler - optimizations at both high level and low level

6 Outline XQuery language features Scientific data processing applications System overview Compiler analyses Translation from high-level to low-level XQuery Identifying and parallelizing reductions Data-centric transformations Type inferencing for XQuery to C++ translation System Architecture Use of Active Data Repository (ADR) as the runtime system Experimental results Conclusions

7 XQuery Overview XQuery - A language for querying and processing XML document - Functional language - Single Assignment - Strongly typed XQuery Expression - for let where return (FLWR) - unordered - path expression Unordered( For $d in document(“depts.xml”)//deptno let $e:=document(“emps.xml”)//emp [Deptno= $d] where count($e)>=10 return {$d, {count($e) } {avg($e/salary)} } )

8 Xquery---An Example Color Histogram summarizing color characteristics of a image for $color in ($red,$blue,$White) let $p:= document(“image.xml”)/pixel where $p/color = $color return {$color} {count($p)}

9 Target Applications Focus on scientific data processing applications Arise in a number of scientific domains Frequently involve generalized reductions Can be expressed conveniently in XQuery Low-level data layout is hidden from the application programmers

10 Satellite Data Processing Time[t] ···  Data collected by satellites is a collection of chunks, each of which captures an irregular section of earth captured at time t  The entire dataset comprises multiples pixels for each point in earth at different times, but not for all times  Typical processing is a reduction along the time dimension - hard to write on the raw data format

11 Satellite- XQuery Code Unordered ( for $i in ( $minx to $maxx) for $j in ($miny to $maxy) let p:=document(“sate.xml”) /data/pixel return {$i} {$j} {accumulate($p)} ) Define function accumulate ($p) as double { let $inp := item-at($p,1) let $NVDI := (( $inp/band1 - $inp/band0)div($inp/band1+$inp/ban d0)+1)*512 return if (empty( $p) ) then 0 else { max($NVDI, accumulate(subsequence ($p, 2 ))) }

12 VMScope- XQuery Code Unordered ( for $i in ( $x1 to $x2) for $j in ($y1 to $y2) let p:=document(“vmscope.xml”) /data/pixel [(x=$i) and ( y=$j) and (scale >=$z1) return {$i} {$j} {accumulate($p)} ) Define function accumulate ($p) as element { if (empty( $p) then $null else let $max=accumulate(subsequence($p,2)) let $q := item-at( $p, 1) return if ($q/scale < $max/scale ) or ($max = $null ) then $max else $q }

13 Challenges Reductions expressed as recursive functions Direct execution can be very expensive Generating code in an imperative language For either direct compilation or use a part of a runtime system Requires type conversion Enhancing locality Data-centric execution on XQuery constructs Use information on low-level data layout

14 External Schema XQuery Sources Compiler XML Mapping Service System Architecture logical XML schemaphysical XML schema C++/C

15 Compilation Framework ADR Code Generation Local ReductionGlobal Reduction Recursion Analysis Type Analysis Data Centric Analysis Compiler Analysis XQuerySchema XQuery ParserSchema Parser Compiler Front-end

16 Compiler Analysis and Tasks Analysis of recursive function - Extracting associative and commutative operations involved - Transform recursive function to iterative operations Data centric transformation -Reconstruct unordered loops of the query -New strategy requires only one scan of the dataset Type inferencing and analysis

17 Recursion Analysis Assumption - Expressed in Canonical form 1) Linear recursive function 2) Operation is associative and commutative Objective - Extracting associative and commutative operations - Extracting initialization conditions - Transform into iterative operations - Generate a global reduction function Canonical Form Define function F($t) { if (p1) then F1 ($t) Else F2(F3($t), F4(F(F5($t)))) }

18 Recursion analysis -Algorithm Algorithm 1. Add leaf nodes represent or are defined by recursive function to Set S. 2. Keep only nodes that may be returned as the final value to Set S. 3. Recursively find a least common ancestor A of all nodes in S. 4. Return the subtree with A as Root. 5. Examine if the subtree represents an associative and communicative operation Example define function accumulate ($p) return double if (empty($p) ) then 0 else let $val := accumulate(subsequence($p,2)) let $q := item-at($p,1) return If ($q >= $val) then $val else $q

19 Data Centric Transformation Objective -Reconstruct unordered loops of the query so that only one scan of the entire dataset is sufficient Algorithm 1. Perform loop fusion that is necessary 2. Generate abstract iteration space 3. Extracting necessary and sufficient conditions that maps an element in the dataset to the iteration space

20 Na ï ve Strategy DatasetOutput Requires 3 Scans

21 Data Centric Strategy DatasetsOutput Requires just one scan

22 Type inferencing Objective - Type analysis for generation of C/C++ code Challenge - An XQuery expression may return multiple types - XQuery supports parametric polymorphism Algorithm -Constraint-based type inference algorithm -Bottom-up and recursive -Compute a set of all possible types an expression may return typeswitch ($pixel) { case element double pixel return max( $pixel/d_value,0) case element integer pixel return max( $pixel/i_value,0) default 0 }

23 Type inference C++ code generation -Use Union to represent multiple simple types -Use C++ class polymorphism represent multiple complex types -Use function clone for parametric Polymorphism class t_pixel pixel; struct tmp result_1{ union { double r1 ; int r2 ; } tmp result_1 t1; if (pixel.tag double pixel) t1.r1 = max 1(pixel.d vaule,0) ; else if (pixel.tag integer pixel) t1.r2 = max 2(pixel.i value,0) ; else t1.r2 = 0;

24 Evaluating Data centric Transformation Virtual MicroscopeSatellite

25 Parallel Performance- VMScope

26 Parallel Performance- Satellite

27 Conclusions A case for the use of XML technologies in scientific data analysis XQuery – a data parallel language ? Identified and addressed compilation challenges A compilation system has been built Very large performance gains from data- centric transformations Achieves good parallel performance


Download ppt "Using XQuery for Flat-File Scientific Datasets Xiaogang Li Gagan Agrawal The Ohio State University."

Similar presentations


Ads by Google