Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory.

Slides:



Advertisements
Similar presentations
University of Chicago Department of Energy The Parallel and Grid I/O Perspective MPI, MPI-IO, NetCDF, and HDF5 are in common use Multi TB datasets also.
Advertisements

1 Slides presented by Hank Childs at the VACET/SDM workshop at the SDM Center All-Hands Meeting. November 26, 2007 Snoqualmie, Wa Work performed under.
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
1 Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group.
SDM Center Coupling Parallel IO with Remote Data Access Ekow Otoo, Arie Shoshani, Doron Rotem, and Alex Sim Lawrence Berkeley National Lab.
Transaction.
Making earth science data more accessible: experience with chunking and compression Russ Rew January rd Annual AMS Meeting Austin, Texas.
Grid Collector: Enabling File-Transparent Object Access For Analysis Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani.
HDF5 FastQuery Accelerating Complex Queries on HDF Datasets using Fast Bitmap Indices John Shalf, Wes Bethel LBNL Visualization Group Kensheng Wu, Kurt.
The Role of DANSE at SNS Steve Miller Scientific Computing Group Leader January 22, 2007.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
Astrophysics, Biology, Climate, Combustion, Fusion, Nanoscience Working Group on Simulation-Driven Applications 10 CS, 10 Sim, 1 VR.
Novelty Detection and Profile Tracking from Massive Data Jaime Carbonell Eugene Fink Santosh Ananthraman.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
CS561-S2004 strategies for processing ad hoc queries 1 Strategies for Processing Ad Hoc Queries on Large Data Warehouses Presented by Fan Wu Instructor:
Chapter 14 The Second Component: The Database.
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.
DM_PPT_NP_v01 SESIP_0715_JP Indexing HDF5: A Survey Joel Plutchak The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon.
ROOT: A Data Mining Tool from CERN Arun Tripathi and Ravi Kumar 2008 CAS Ratemaking Seminar on Ratemaking 17 March 2008 Cambridge, Massachusetts.
SensIT PI Meeting, January 15-17, Self-Organizing Sensor Networks: Efficient Distributed Mechanisms Alvin S. Lim Computer Science and Software Engineering.
Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson.
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
Scientific Data Management (SDM)
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
1 Scientific Data Management Center DOE Laboratories: ANL: Rob Ross LBNL:Doron Rotem LLNL:Chandrika Kamath ORNL: Nagiza Samatova.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Visualization Group March 8 th, Visualization Group Permanent staff: –Wes Bethel (group leader) –John Shalf, Cristina Siegerist, Raquel Romano Collaborations:
Int. Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT2005), Zeuthen, Germany, May 2005 Bitmap Indices for Fast End-User.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
1 Arie Shoshani, LBNL SDM center Scientific Data Management Center(SDM-ISIC) Arie Shoshani Computing Sciences Directorate Lawrence Berkeley National Laboratory.
Chapter 4 Realtime Widely Distributed Instrumention System.
Bitmap Indices for Data Warehouse Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
The STAR Grid Collector and TBitmapIndex John Wu Kurt Stockinger, Rene Brun, Philippe Canal – TBitmapIndex Junmin Gu, Jerome Lauret, Arthur M. Poskanzer,
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Opportunities in Parallel I/O for Scientific Data Management Rajeev Thakur and Rob Ross Mathematics and Computer Science Division Argonne National Laboratory.
Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.
Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics.
Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.
SDM Center’s Data Mining & Analysis SDM Center Parallel Statistical Analysis with RScaLAPACK Parallel, Remote & Interactive Visual Analysis with ASPECT.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
1 Arie Shoshani, LBNL SDM center Scientific Data Management Center (Integrated Software Infrastructure Center – ISIC) Arie Shoshani All Hands Meeting March.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Presented by Scientific Data Management Center Nagiza F. Samatova Network and Cluster Computing Computer Sciences and Mathematics Division.
John Wu Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory.
GEON2 and OpenEarth Framework (OEF) Bradley Wallet School of Geology and Geophysics, University of Oklahoma
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Object Oriented Database By Ashish Kaul References from Professor Lee’s presentations and the Web.
VAPoR: A Discovery Environment for Terascale Scientific Data Sets Alan Norton & John Clyne National Center for Atmospheric Research Scientific Computing.
I/O for Structured-Grid AMR Phil Colella Lawrence Berkeley National Laboratory Coordinating PI, APDEC CET.
STAR Collaboration, July 2004 Grid Collector Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani Lawrence Berkeley National.
Grand Challenge in MDC2 D. Olson, LBNL 31 Jan 1999 STAR Collaboration Meeting
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Presented by Scientific Data Management Center Nagiza F. Samatova Oak Ridge National Laboratory Arie Shoshani (PI) Lawrence Berkeley National Laboratory.
Supercomputing 2006 Scientific Data Management Center Lead Institution: LBNL; PI: Arie Shoshani Laboratories: ANL, ORNL, LBNL, LLNL, PNNL Universities:
March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
Data Warehouse.
SDM workshop Strawman report History and Progress and Goal.
Physical Database Design
Document Visualization at UMBC
Wellington Cabrera Advisor: Carlos Ordonez
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October Outline Current work —FastBit: a compressed bitmap indexing package —Applications: Grid Collector DEX TBitmapIndex Network Flow Data Analysis Future Plans —Extending the searching technology —Integrating with other SDM center technologies

FastBit A compressed bitmap indexing technology for efficient searching of read-only data John Wu, Ekow Otoo, Arie Shoshani Kurt Stockinger, Doron Rotem

SDM All-hands, October FastBit Overview FastBit is designed to search multi- dimensional data —Conceptually in table format rows  objects columns  attributes FastBit uses vertical (column-oriented) organization for the data —Efficient for analysis of read-only data FastBit uses compressed bitmap indices to speed up searches —Proven in analysis to be optimal for single- attribute queries —Superior to other optimal indices because they are also efficient for multi-attribute queries row column

Grid Collector Put FastBit and SRM together to improve the efficiency of STAR analysis jobs John Wu, Junmin Gu, Jerome Lauret, Arthur M. Poskanzer, Arie Shoshani, Alexander Sim, Wei-Ming Zhang

SDM All-hands, October Grid Collector Features Key features of the Grid Collector: —Providing transparent object access —Selecting objects based on their attribute values —Improving analysis system’s throughput —Enabling interactive distributed data analysis

SDM All-hands, October Grid Collector Speeds up Analyses Legend —Selectivity: fraction of events needed by the analysis —Speedup = ratio of time to read events without GC and with GC —Speedup = 1: speed of the existing system (without GC) Results —When searching for rare events, say, selecting one event out of 1000 (selectivity = 0.001), using GC is 20 to 50 times faster —Even using GC to read 1/2 of events, speedup > 1.5 less selective  more selective

DEX: Using Efficient Bitmap Indices to Accelerate Scientific Visualization Kurt Stockinger, John Shalf, Wes Bethel, John Wu Computational Research Division Lawrence Berkeley National Laboratory Berkeley, California

SDM All-hands, October DEX: Dexterous Data Explorer Data Query Visualization Toolkit (VTK) 3D visualization of a Supernova explosion

SDM All-hands, October Performance Results with Scientific Data One of the simplest tasks DEX performs is to find isosurface DEX is on average a factor of three to four faster than the best isosurface algorithm of VTK. VTK rendering time: 0.2 – 2 seconds.

SDM All-hands, October Query-Driven Visualization of Combustion Data Set b) Q: temp < 3 c) Q: CH4 > 0.3 AND temp < 3 d) Q: CH4 > 0.3 AND temp < 4 a) Query: CH4 > 0.3

TBitmapIndex: An attempt to introduce FastBit to ROOT Kurt Stockinger 1, John Wu 1, Rene Brun 2, Philippe Canal 3 (1) Berkeley Lab, Berkeley, USA (2) CERN, Geneva, Switzerland (3) Fermi Lab, Batavia, USA

SDM All-hands, October Current Status Built a prototype wrapper on FastBit called TBitmapIndex —Read one variable at a time into memory to build index —Each Index is currently stored in a binary file Integrated bitmap indices to support: —TTree::Draw —TTree::Chain Verified the performance advantage of FastBit vs. ROOT’s TTreeFormula

SDM All-hands, October Experiments With BaBar Data Software/Hardware: —Bitmap Index Software is implemented in C++ —Tests carried out on: Linux CentOS 2.8 GHz Intel Pentium 4 with 1 GB RAM Hardware RAID with SCSI disk Data: —7.6 million records with ~100 attributes each —Babar data set: Bitmap Indices (FastBit): —10 out of ~100 attributes —1000 equality-encoded bins —100 range-encoded bins

SDM All-hands, October Size of Compressed Bitmap Indices EE-BMI: equality-encoded bitmap index RE-BMI: range-encoded bitmap index

SDM All-hands, October Query Performance - TTreeFormula vs. Bitmap Indices Bitmap indices 10X faster than TTreeFormula

An Application of TBitmapIndex -- Network Flow Data Analysis Kurt Stockinger, John Wu, Scott Campbell, Stephen Lau, Mike Fisk, Eugene Gavrilov, Alex Kent, Christopher E. Davis, Rick Olinger, Rob Young, Jim Prewett, Paul Weber, Thomas P. Caudell, E. Wes Bethel, Steve Smith LBNL, LANL, UNM

SDM All-hands, October Chasing the Track of a Network Scan IDS log shows —Jul 28 17:19:56 AddressScan has scanned 19 hosts (62320/tcp) —Jul 28 19:19:56 AddressScan has scanned 19 hosts (62320/tcp) Using FastBit/ROOT to explore what else might be going on Queries prepared by Scott Campbell. More details at

SDM All-hands, October Are There More Scans? Query: select ts/(60*60*24)-12843, IPR_C, IPR_D where IPS_A=211 and IPS_B=207 More scans from the same subnet

SDM All-hands, October Who Is Doing It? Query: select IPS_C, IPS_D where IPS_A==211 and IPS_B==207 Picture: the histogram of the IPS_C and IPS_D Five IP addresses started most of the scans!

Future Plans Meet the challenges of searching in data intensive sciences

SDM All-hands, October Types of Searching Problems Not practical to work on many terabytes of data simultaneously  work on a subset instead —Analyze the data collected last month —Analyze the data collected by Joe Find the objects of interest —Find the flame front in combustion simulation —Find the top-talker in network communication Knowledge discovery —Association rules —Cliques/connection subgraphs

SDM All-hands, October Searching Problems From SciDAC2 Appendix B.1 Experimental Combustion Science Feature identification and tracking 20TB B.8 Empowering RHIC users with new analysis tools Analyze subsets~GB/s B.10 U.S. LHC ExperimentsAnalyze subsets~GB/s B.13 The Solenoid Tracker at RHIC (STAR) Analyze subsets1GB/s B.2 Advanced Computing for LCLS?, classification200 MB/s B.3 An Earth Science Knowledge System Locating dataset of interest PB B.5 Enabling Discovery in Experimental Biological Science High-dimensional data search, data versioning, semantic graphs (ontology), multiple sources

SDM All-hands, October Searching Problems From SciDAC2 Appendix B.4 Remote operations of LHC, CMS and ITER Streaming data B.9 ARM/ACRF ProgramInstrument data streams B.6 Enhancing Material Science Beamline ND data array, real-time processing 1GB/h ? B.7 Large-Scale Computation for ITERData management B.11 NanoscienceMining simulation data together with experimental data B.12 The Spallation Neutron SourceReal-time image analysis, data comparison 20MB/s

SDM All-hands, October Features of These Search Problems Large: many datasets are petabytes in size, billions records Complex data: multi-dimensional arrays, user-defined data types, mixed simulation data with experimental data, regular data with attribute defined with ontologies (semantic networks) Complex searching: data versioning, provenance-based search, catalog matching Beyond searching: data mining and knowledge discovery Real-time response: instrument control, interactive designed of experiments, computational steering Integrated: searching is only a part of the overall data analysis, need to improve the overall throughput

SDM All-hands, October Improve Existing Searching Tools FastBit is efficient for range queries; need to support other types of queries, e.g., joins FastBit is efficient for read-only data; need to support update FastBit supports up to 2 32 (4 billion) records; need to support at least 2 64 (16 quintillion) records FastBit allows the user to choose from many different type of indices; need to automatically decide one for the user

SDM All-hands, October Expand The Repertoire Of Searching Tools Support parallel index building and searching Support search of semantic networks, combining ontology with structured data Support data versioning (time stamps, provenance, …) Support robust recovery (a la POSTGRES) Support user-defined data types (ROOT) Support user-defined functions Support commonly used B-trees and R-trees Support combined searching of structured and semi- structured data, extend

SDM All-hands, October Extend The Accessibility Of The Tools Extend the collaboration with ROOT to make FastBit seamlessly available to users —Implemented a prototype, need a more integrated way to read and write ROOT files Read data from other common file formats; write indices to the same file formats —netCDF, HDF (4/5) Extend the advantage of searching to other steps of analysis —Feature tracking; extending it to higher dimension; more general image analysis Make FastBit available in other forms —Web service, an actor in Kepler, …

SDM All-hands, October Summary FastBit is efficient for range queries on read-only data Integration of FastBit with ROOT is getting underway —TBitmapIndex prototype Integration with other systems possible —Need to develop a short list based on target application area Plan to extend FastBit —Integration with ROOT will bring up a list of requirements —Intend to target biological applications