Querying and Analysing Big Scientific Data

Querying and Analysing Big Scientific Data
Thomas Heinis Thanks for the intro What I will talk about today is Because what is happening in the sciences today is a drastic paradigm shift towards data driven scientific discovery -> forward slide

Data-Driven Scientific Discovery
Large Hadron Collider 12 Petabytes / experiment Sloan Digital Sky Survey 4 Petabytes / year Human Brain Project ~100 Gigabytes / sec LHC ATLAS What that means is that scientists no longer define a hypothesis, design an experiment, collect the data and then analyze the data to support the hypothesis. Instead they collect as much data and then derive hypothesis and support from it Examples: 2) Astronomers no longer make single observation but scan entire sky to then analyze the data 3) In fluid dynamics, no longer do scientists study a phenomenon with pen and paper, instead they define a model and simulate it on a supercomputer leading to a plethora of data 4) This true across disciplines, LHC, SDSS (successor is Large Synoptic Survey Telescope will produce same data per night) and HBP the projects I am affiliated Storing this data in all disciplines is already a formidable undertaking, but analyzing is an even more daunting challenge and scientists are overwhelmed with big data But it’s not just that there is a lot of data, the challenge is that is growing rapidly Human Brain Project SDSS Scientists Are Overwhelmed with Big Data

Scientific Data Growth
Growth due to two trend: precise instruments and computational power Why now? Observational Data: High precision instruments Simulation Data: Cheap computing resources Astronomy -> more precise instruments Physics -> more precise instruments and also bigger experiments simulation -> cheaper and thus more hardware Gene sequencing - >new, faster instruments (next generation sequencing) Astronomy: National Radio Astronomy Observatory: Atacama Large Millimeter Array Physics: CERN high energy physics data from the Atlas, LHCb, CMS and Alice experiments Simulation: Integrated Computing Environment for Scientific Simulation (ICESS) Gene Sequencing: EBI = European Bioinformatics Institute Scientific Data Grows Superlinear!

Data in the Simulation Sciences
Increasing level of detail Increasing simulation duration RESOLUTION DURATION Increasing model size by order of magnitude 1) Let me make an example how data grows in the simulation sciences Lets see an example from the brain simulation project Human Brain Project Data is growth is a multi dimensional problem. 2) Coverage grows from small to bigger models. They started out with a small 3) Resolution increases as their instruments improve and as they model everything in more detail 4) Finally the duration of simulation increases Dimensions are not independent but are multiplicative, meaning that data grows superlinear or even exponential We need algorithms that can scale with this trend otherwise we are doomed! COVERAGE Dimensions are Multiplicative!

Simulation Science Data Challenges
Spatial Modeling Simulation Observational Data Dynamic 3D Exploration Post Simulation Data 1) But Where does the massive data actually hurt???? 2) Simple workflow used in the simulation sciences 2) In the remainder of this talk I will talk about the work I and Ph.D. students have been done to address these challenges in recent years Static 3D Exploration Interactive 3D Exploration Spatial Analysis Need Scalable Spatial Access Methods

Spatial Modeling Simulation Observational Data Dynamic 3D Exploration Post Simulation Data In the remainder of this talk I will talk about the work I and Ph.D. students have been done to address these challenges in recent years Static 3D Exploration Interactive 3D Exploration Spatial Analysis Need Scalable Spatial Access Methods

Efficient Spatial Index is Crucial
Static Exploration Neural Tissue Model 3D Spatial Range Query Single Neuron 3D Model Not done Efficient Spatial Index is Crucial

State-of-the-Art Spatial Indexes
R-Tree: Hierarchy of Minimum Bounding Rectangles (MBR) R-Trees Variants: Hilbert packed R-Tree STR R-Tree PR-Tree c Range Query Overlap To execute queries, the R-Tree is traversed top down by comparing the query MBR if it intersects with child node MBRs In case of multiple intersections all paths are selected for tree traversal Hence the overlap problem is generic to R-Tree since the query execution problem nearly always remain the same. Some implementation of Space Partitioning hierarchies such as Quadtree Octrees and KD-Trees have also problem of overlap when used with volume objects(not points) because they stretch the tree node MBR to accommodate volumetric objects. Query execution avoids overlap problem objects are replicated (R+Tree or Quadtree with replication see R+Tree slide) Structural Overlap Degrades Performance

Scalability Challenge
Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk. Range Queries: Uniform Random 500 for each experiment. State of the Art Does Not Scale with Density Spatial Density Increases with Dataset Size

FLAT: A Two Phase Spatial Index
Requires Reachability c Key Idea: Two phases, each independent of overlap: For this reason we develop FLAT a 2 phase spatial index These are the high level ideas BUT there is a problem in reachability if we are to implement this idea 1) SEEDING: Find any one object 2) CRAWLING: Traverse neighborhood Use Connectivity To Avoid Overlap

Add Connectivity → Enable Recursive Crawling
FLAT: Reachability c Index Building: 1) Partitioning Group spatially close elements 2) Linking Connect neighboring partitions The Tessellation we used is called “Sort Tiling” It requires us to sort data as many times as dimensions of data set. We first sort data on X dimension and make buckets = cuberoot ( Total Objects / pageSize ) For each bucket we sort on Y dimension and make buckets = cuberoot ( Total Objects / pageSize ) For each bucket we sort on Z dimension and make buckets = cuberoot ( Total Objects / pageSize ) Our data is 3 dimensional hence we use cube root other wise we can use square root for 2D data We have tried Voronoi Tessellation. It works but is very slow, 3D version of Voronoi is O(N^2) complexity. Any Tessellation can work, We sort using center of objects To accommodate volumetric objects instead of points we stretch the boundary of the partition of the tessellation We do Linking by first constructing an R-Tree based on partition MBRs (that we use for seeding) Then we use the same R-Tree and use partition MBR as queries to find the neighboring intersection partitions. So the complexity of linking is O(NlogN) Linking problem is spatial self join, can be done faster in memory using grid hashing. Add Connectivity → Enable Recursive Crawling

FLAT: Querying Without Overlap
R-Tree Seed Query 1) Seeding: Find ANY ONE object 2) Crawling: Breadth-First-Search: Retrieve Remaining results Seeding is very fast can be potentially substituted with other techniques such as KD-tree or Grid hash We chose to do it with R-Tree because R-Tree gives us good performance when data cannot fit into memory. KD-Tree and grid hash are memory based Seeding returns zero results only when the query region is empty, Hence crawling in that case is not required. Practically seeding is fast < 10 page reads Worst case is when query region is empty so it requires to lookup all the R-Tree MBRs that exist within the empty query region However, it turns out practically there are very few MBRs in such regions, because there is no data there (empty region) But Data Skew is a potential problem for Seeding Seeding & Crawling Avoids Overlap Overhead

FLAT: Performance Evaluation
Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk. Range Queries: Uniform Random 500 for each experiment. 7.8 x Decouples Execution Time from Density Spatial Density Increases with Dataset Size

Trend is “FLAT”, Scales With Density
FLAT: Scalability Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk. Range Queries: Uniform Random 500 for each experiment. FLAT is basically horizontal line, with no dependence on data density. However the bump in the beginning is the small seeding cost which amortizes later on Seeding cost amortizes with increase in result cardinality Trend is “FLAT”, Scales With Density

Spatial Modeling Simulation Observational Data Dynamic 3D Exploration Post Simulation Data Okay, so let me move on to interactive 3D exploration Static 3D Exploration Interactive 3D Exploration

Analyzing Spatial Data
Arterial Tree of the Heart Spatial Range Query Sequences Guiding Path Bronchial Tree of the Lung When given a neural netwrok, neuroscientists frequnetly execute not just one query, but a series of them Neural Network Guided Analysis Ubiquitous in Scientific Applications 16

Interactive Query Execution
Guiding paths are not known in advance Interactive execution of query sequence Prefetching Opportunity 1st Query 2nd Query 3rd Query 1st Query 2nd Query 3rd Query Path decided after processing results DISK CPU Retrieve Query Results Process Results Time In their very nature, the these sequences are interactive Prefetch Data Prediction Predict next query location in the sequence Prefetch data of next query into prefetch cache Predictive Prefetching Hides Data Retrieval Cost

Predictive Prefetching
Large Volume Queries Small Volume Existing techniques: Extrapolate past query locations Exponential Weighted Moving Average (EWMA) Straight Line Hilbert Prefetching Neuroscience Dataset 25 query in Sequence Data set and sequence length Slide Done Not Efficient With Arbitrary Query Volume!

SCOUT: Content Aware Prefetching
Key Insight: Use previous query content! Approach: Inspect query results Identify guiding path Predict next query using guiding path ? Our technique which is fundamentally different and uses results to figure out the guiding structure. Inspect results and rebuild the graph (generally applicable) Identify structure Extrapolate the next query location and load incrementally Major challenge is identifying structure 2 goals Scalability General Applicability Need to Identify Guiding Path

SCOUT: Guiding Path Identification
Iterative Candidate Pruning Key Insight: Guiding path goes through all queries! n+3 Guiding path Paths Candidate set Predicted Query n n+1 n+2 Longer Sequence → Better Prediction

SCOUT: Prediction Accuracy
Dataset: 100K neurons, 450 Million 3D cylinders, 27 GB on disk Speedup 2x Speedup 14.7x Explain setup and sequences! Sequence 1 Sequence 2 80K [μm3] 32 Query Volume: Sequence Length: 20K [μm3] Visualization SCOUT speeds up sequences up to 14.7x 72% - 91% Prediction Accuracy Cache Hit Rate = Amount of data retrieved from cache Total amount of data retrieved x 100

Spatial Modeling Simulation Observational Data Dynamic 3D Exploration Post Simulation Data Alright, so let me move on to the dynamic exploration Static 3D Exploration Interactive 3D Exploration

Efficient Spatial Queries on Dynamic Dataset
Dynamic Exploration Monitoring Memory Resident Spatial Mesh Models Time Step 1 Time Step 2 Time Step 3 Dynamic exploration is needed for monitoring mesh models residing in main memory during the simulation Massive Updates Few Queries time Simulation Time Step Monitor Simulation Time Step Monitor Efficient Spatial Queries on Dynamic Dataset

Static Spatial Indexes
State of the Art Moving Object Indexes TPR-Tree, STRIPES Static Spatial Indexes R-Tree, LUR-Tree, QU-Trade Linear Scan Mesh Movement is Inherently Unpredictable Coarse Grained Fine Grained Neither Scales with Size nor Detail!

Update Oblivious Query Execution
OCTOPUS Key Insight: Use Mesh Connectivity to Retrieve Query Results! Range Query Time step 2 Time step 3 Time step 1 What About Non-Convex Meshes? Update Oblivious Query Execution

OCTOPUS: Non-Convex Meshes
Surface Scan ? Scan entire surface and start from all surface elements inside the query range No Reachability! Using Mesh Surface Guarantees Accuracy

OCTOPUS: Mesh Deformation
Graph changes Time step 1 Time step 2 Time step 3 Deformation: Zero Cost of surface maintenance Scales With Massive Updates

Scales with Mesh Resolution
OCTOPUS: Mesh Detail Quadratic Increase Surface Points Cubic Increase Non-Surface Points Scalability: Surface grows slower than volume (and therefore dataset size)! Scales with Mesh Resolution

OCTOPUS: Performance Evaluation
Dataset: Neural Tetrahedral Mesh, 33GB Memory resident Queries: 15 per time step, 60 time steps [total 900 queries] Hardware: AMD Opteron 64bit, 2700 MHz, 48GB RAM. 8x-24x Improvement

Contributions FLAT SCOUT OCTOPUS TOUCH GIPSY Spatial Modeling
Spatial Analysis Model Validation Spatial Modeling Simulation Observational Data Post Simulation Data Highlight key insight per approach FLAT: two phased index TOUCH: optimize on comparisons SCOUT: use query results OCTOPUS: use data to avoid dealing with massive updates

Impact Blue Brain Project: General Applicability:
2013 (2.5 TB) Impact Blue Brain Project: Part of the toolset used every day February 2013: first 10 million neuron model built Still 4 orders of magnitude smaller than human brain General Applicability: Material Sciences Astronomy Geographical Information Systems 2010 2008 2006

Conclusions Enabling data exploration is key to scientific discovery.
Prior spatial access methods do not scale with data growth. Use Spatial Connectivity to achieve scalability. Explicitly Added (FLAT & TOUCH) Implicitly Present in the Dataset (OCTOPUS & SCOUT)

Thank You!

Querying and Analysing Big Scientific Data

Similar presentations

Presentation on theme: "Querying and Analysing Big Scientific Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Querying and Analysing Big Scientific Data

Similar presentations

Presentation on theme: "Querying and Analysing Big Scientific Data"— Presentation transcript:

Similar presentations

About project

Feedback