ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Slides:



Advertisements
Similar presentations
Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
Advertisements

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.
Spark: Cluster Computing with Working Sets
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.
1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Storage in Big Data Systems
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.
Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.
Big Data Vs. (Traditional) HPC Gagan Agrawal Ohio State ICPP Big Data Panel (09/12/2012)
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin
Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
OMFS An Object-Oriented Multimedia File System for Cluster Streaming Server CHENG Bin, JIN Hai Cluster & Grid Computing Lab Huazhong University of Science.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.
SC 2013 SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, and others)
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Thomas Heinis* Eleni Tzirita Zacharatou ‡ Farhan Tauheed § Anastasia Ailamaki ‡ RUBIK: Efficient Threshold Queries on Massive Time Series § Oracle Labs,
PIDX PIDX - a parallel API to capture the data models used by HPC application and write it out in an IDX format. PIDX enables simulations to write out.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
Distributed Network Traffic Feature Extraction for a Real-time IDS
Genomic Data Clustering on FPGAs for Compression
Sameh Shohdy, Yu Su, and Gagan Agrawal
Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz
Tools and Techniques for Processing and Management of Data
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Linchuan Chen, Peng Jiang and Gagan Agrawal
Data-Intensive Computing: From Clouds to GPU Clusters
1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The.
Declarative Transfer Learning from Deep CNNs at Scale
Yi Wang, Wei Jiang, Gagan Agrawal
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Map Reduce, Types, Formats and Features
Accelerating Regular Path Queries using FPGA
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University † Los Alamos National Laboratory

ICPP 2012 Outline Motivation and Introduction Background System Overview and Optimization Experiment Conclusion

ICPP 2012 Motivation Science becomes increasingly data driven; Strong desire for efficient data visualization; Challenges: –Fast data generation speed –Slow disk IO and network speed –Worse performance during visualization –Different kinds of subsetting requests Difficult and Unnecessary to visualize all the data

ICPP 2012 Data Subsetting in Paraview A widely used data analysis and visualization application Problems: Load + Filter mode –Load the entire data set –Data filtering in visualization level Threshold Filter: based on values Extract Subset Filter: based on dimension info –Grid transformation needed during filtering Regular Structured Grid -> Unstructured Grid

ICPP 2012 A Faster Solution Subset at the I/O level –User specifies the subset in one query for both dimension and value ranges –Reduced I/O time and memory footprint SQL queries in ParaView –Query over Dimensions – API support –Query over Values - Indexing Bitmap Indices and Parallel Bitmap Indices –Efficient subsetting over values

ICPP 2012 Background: Bitmap Indexing Fastbit: widely used in Scientific Data Management Suitable for float value for binning small ranges Run Length Compression(WAH, BBC) –Compress bitvector based on continuous 0s or 1s

ICPP 2012 Bitmap Index and Dim Subset Run-length Compression(WAH, BBC) –Good: compression rate, fast bitwise operation; –Bad: ability to locate dim subset is lost; Two traditional methods: –With bitmap indices: post-filter on dim info; –Without bitmap indices: post-filter on values; Two-phase optimization: –Index Generate: Distributed Indices over sub-blocks; –Index Retrieval: Transform dim subsetting info into bitvectors, and support fast bitwise operation;

ICPP 2012 System Overview Parse the SQL expression Parse the metadata file Generate Query Request Index Generation if not generated; Index Retrieving after that.

ICPP 2012 Optimization 1: Distributed Index Generation Study relationship between Queries and Partitions. Partition the data based on Query Preference

ICPP 2012 Index Partition Strategy α rate: Participation rate of data elements –Number of elements in indexing / Total data size –Worst: All elements have to be involved –Ideal: Elements exact the same as dim subset Partition Strategies: –Strategy 1: α is proportional to dim subsetting percentage and inversely proportional to number of partitions. –Strategy 2: In general cases where subsetting over each dimension has a similar probability, the partition should have equal preference over each dim. –Strategy 3: If queries only include a subset of dims, the partition should also be based on these dims.

ICPP 2012 Optimization 2: Index Retrieval Post-filter?

ICPP 2012 Parallel Index Architecture L3: data block L1: data file L2: variable

ICPP 2012 Experiment Setup Goals: –SQL subsetting vs. Load + Filter in Paraview –Scalability of parallel indexing method –Indexing and Partition Strategy vs. FastQuery Dataset: –Parallel Ocean Program –Data size: 33.6 GB –Data format: NetCDF(array based) Environment: –IBM Xeon Cluster 8 cores, 2.53GHZ –12 GB memory

ICPP 2012 Efficiency Comparison with Filtering in Paraview Data size: 5.6 GB Input: 400 queries Depends on subset percentage General index method is better than filtering when data subset < 60% Two phase optimization achieved a 0.71 – speedup compared with filtering method Index m1: Bitmap Indexing, no optimization Index m2: Use bitwise operation instead of post-filtering Index m3: Use both bitwise operation and index partition Filter: load all data + filter

ICPP 2012 Memory Comparison with Filtering in Paraview Data size: 5.6 GB Input: 400 queries Depends on subset percentage General index method has much smaller memory cost than filtering method Two phase optimization only has small extra memory cost Index m1: Bitmap Indexing, no optimization Index m2: Use bitwise operation instead of post-filtering Index m3: Use both bitwise operation and index partition Filter: load all data + filter

ICPP 2012 Scalability with Different Proc# Data size: 8.4 GB Proc#: 6, 24, 48, 96 Input: 100 queries X pivot: subset percentage Y pivot: time Each process take care of one sub-block Good scalability as number of processes increases

ICPP 2012 Alpha Rate with Different Proc# Data size: 8.4 GB Proc#: 6, 24, 48, 96 Input: 100 queries X pivot: subset percentage Y pivot: Alpha Rate More number of processes means more index partitions Good participation rate when selecting a smaller percentage data subset

ICPP 2012 Alpha Rate and IO Access Times Comparison with FastQuery FastQuery: Build relational table view over scientific dataset Difference: doesn’t consider multi-dimension data features Data size: 8.4 GB, 48 processes Query Type: value + 1 st dim, value + 2 nd dim, value + 3 rd dim, overall Input: 100 queries for each query type

ICPP 2012 Efficiency Comparison with FastQuery Data size: 8.4 GB Proc#: 48 Input: 100 queries for each query type Achieved a 1.41 to 2.12 speedup compared with FastQuery

ICPP 2012

Conclusion Big data issue in data analysis and visualization Find exact data subset in IO level with SQL interface and bitmap indexing A good speedup compared with filtering method Data partition strategy and parallel indexing A good speedup compared with FastQuery

ICPP 2012 Thanks 22