Yicheng Tu, § Shaoping Chen, §¥ and Sagar Pandit § § University of South Florida, Tampa, Florida, USA ¥ Wuhan University of Technology, Wuhan, Hubei, China.

Slides:



Advertisements
Similar presentations
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Advertisements

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
Efficient Evaluation of k-Range Nearest Neighbor Queries in Road Networks Jie BaoChi-Yin ChowMohamed F. Mokbel Department of Computer Science and Engineering.
CS4432: Database Systems II
Fast Algorithms For Hierarchical Range Histogram Constructions
N-Body I CS 170: Computing for the Sciences and Mathematics.
Chapter 6 Sampling and Sampling Distributions
Danzhou Liu Ee-Peng Lim Wee-Keong Ng
MS 101: Algorithms Instructor Neelima Gupta
. The sample complexity of learning Bayesian Networks Or Zuk*^, Shiri Margel* and Eytan Domany* *Dept. of Physics of Complex Systems Weizmann Inst. of.
BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.
Yoshiharu Ishikawa (Nagoya University) Yoji Machida (University of Tsukuba) Hiroyuki Kitagawa (University of Tsukuba) A Dynamic Mobility Histogram Construction.
U N I V E R S I T Y O F S O U T H F L O R I D A Computing Distance Histograms Efficiently in Scientific Databases Yicheng Tu, * Shaoping Chen, *§ and Sagar.
Random Walks and BLAST Marek Kimmel (Statistics, Rice)
Planning under Uncertainty
Artificial Learning Approaches for Multi-target Tracking Jesse McCrosky Nikki Hu.
Chapter 7 Sampling and Sampling Distributions
Point and Confidence Interval Estimation of a Population Proportion, p
On the Construction of Energy- Efficient Broadcast Tree with Hitch-hiking in Wireless Networks Source: 2004 International Performance Computing and Communications.
Evaluating Hypotheses
Chapter 3: Data Storage and Access Methods
Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Privacy and Integrity Preserving in Distributed Systems Presented for Ph.D. Qualifying Examination Fei Chen Michigan State University August 25 th, 2009.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
1 Random Walks in WSN 1.Efficient and Robust Query Processing in Dynamic Environments using Random Walk Techniques, Chen Avin, Carlos Brito, IPSN 2004.
Part III: Inference Topic 6 Sampling and Sampling Distributions
Experimental Evaluation
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Data Mining Techniques
DEXA 2005 Quality-Aware Replication of Multimedia Data Yicheng Tu, Jingfeng Yan and Sunil Prabhakar Department of Computer Sciences, Purdue University.
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
Energy Efficient Routing and Self-Configuring Networks Stephen B. Wicker Bart Selman Terrence L. Fine Carla Gomes Bhaskar KrishnamachariDepartment of CS.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Week 2 CS 361: Advanced Data Structures and Algorithms
A.Broumandnia, 1 3 Parallel Algorithm Complexity Review algorithm complexity and various complexity classes: Introduce the notions.
Data Structures for Orthogonal Range Queries A New Data Structure and Comparison to Previous Work. Application to Contact Detection in Solid Mechanics.
Department of Computer Science & Engineering Abstract:. In our time, the advantage of technology is the biggest thing for current scientific works. One.
Language Objective: Students will be able to practice agreeing and disagreeing with partner or small group, interpret and discuss illustrations, identify.
U N I V E R S I T Y O F S O U T H F L O R I D A Database-centric Data Analysis of Molecular Simulations Yicheng Tu *, Sagar Pandit §, Ivan Dyedov *, and.
Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.
Simulated Annealing.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Chapter Thirteen Copyright © 2004 John Wiley & Sons, Inc. Sample Size Determination.
Statistics: Unlocking the Power of Data Lock 5 Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9.
Improving Loss Resilience with Multi- Radio Diversity in Wireless Networks by Allen Miu, Hari Balakrishnan and C.E. Koksal Appeared in ACM MOBICOM 2005,
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Network-Aware Query Processing for Stream- based Application Yanif Ahmad, Ugur Cetintemel - Brown University VLDB 2004.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Program Performance 황승원 Fall 2010 CSE, POSTECH. Publishing Hwang’s Algorithm Hwang’s took only 0.1 sec for DATASET1 in her PC while Dijkstra’s took 0.2.
Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random.
Similarity Search without Tears: the OMNI- Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information.
Spatial Data Management
Updating SF-Tree Speaker: Ho Wai Shing.
Department of Computer Science and Engineering, USF
Efficient Image Classification on Vertically Decomposed Data
Spatial Online Sampling and Aggregation
Efficient Image Classification on Vertically Decomposed Data
Divide and Conquer / Closest Pair of Points Yin Tat Lee
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Topological Signatures For Fast Mobility Analysis
Distributed Handling of large Level of Detail Surfaces
Chapter 4 . Trajectory planning and Inverse kinematics
Presentation transcript:

Yicheng Tu, § Shaoping Chen, §¥ and Sagar Pandit § § University of South Florida, Tampa, Florida, USA ¥ Wuhan University of Technology, Wuhan, Hubei, China Computing Distance Histograms Efficiently in Scientific Databases

Particle simulations Major tool in computational sciences Simulates a system of N entities behaving as classical particles – Atoms – Molecules – Stars – Galaxies N is large (millions to billions)

Particle Simulation in Astronomy Biology/chemistry – molecular simulation (MS) Photo source:

Querying MS data Point query is not very helpful Analytical queries are the main stream – Aggregates are not enough – Some require superlinear processing time The m-body correlation functions - O(N m ) computations if done in a brute-force way We study spatial distance histograms (SDH) SDH is critical for analyzing key system states SDH is a 2-body correlation - can we beat O(N 2 )?

Problem Statement Given coordinates of N points, draw a histogram of all pairwise distances – Total distance counts will be N(N-1)/2 We focus on the standard SDH – Domain of distance [0, L max ] – Buckets are of the same width: [0,p), [p, 2p), … – Query has one single parameter: bucket width p of the histogram, or Total number of buckets l = L max /p

Our approach Main idea: avoid the calculation of pairwise distances Observation: – two groups of points can be processed in one shot (constant time) if the range of all inter-group distances falls into a histogram bucket – We say these two groups are resolved into that bucket bucket i Histogram[i] += 20*15;

The DM-SDH Algorithm Organize all data into a Quad-tree (2D data) or Oct- tree (3D data). Cache the atoms counts of each tree node – Density map: all counts in one tree level The algorithm : start from one proper density map M 0 FOR every pair of nodes A and B in M 0 resolve A and B IF A and B are not resolvable THEN FOR each child node A’ of A FOR each child node B’ of B resolve A’ and B’

A Case Study [0, 2√2] [√10, √34] Distance calculations needed for those irresolvable nodes on the leaf level !!

Is DM-SDH efficient (asymptotically)? Quantitative analysis of DM-SDH Based on a geometric modeling approach The main result: Notes: 1.α(m) is the percentage of pairs of nodes that are NOT resolvable on level m of the quad(oct)tree. 2.We managed to derive a closed-form for α(m) 3.α(m) decreases exponentially with more density maps visited

Easily, we have Theorem 1: Let d be the dimension of data, then time spent on resolving nodes in DM-SDH is Theorem 2: time spent on calculating distances of atoms in leaf nodes that are not resolvable is also

Therefore, Theorem 3: time complexity of DM-SDH is – O(N 1.5 ) for 2D data – O(N ) for 3D data – Theorem holds true for all reasonably distributed data (most scientific data fall into this category)

Experimental verification (2D data)

Experimental verification (3D data)

Faster algorithms O(N ) not good enough for large N? Our solution: approximate algorithms based on our analytical model – Time: Stop before we reach the leaf nodes – Approximation: for irresolvable nodes, distribute the distance counts into the overlapping buckets heuristically – Correctness: consult the table we generate from the model

What our geometric model says … Example: under l = 128, for a correctness guarantee of 97%, we have m=5, i.e., only need to visit 5 levels of the tree (starting from M 0 )

How much time is needed? Given an error bound Є, number of level to visit is No time will be spent on distance calculation Time to resolve nodes in the tree is where I is the number of node pairs in M 0 The time is independent to system size N !

Irresolvable nodes Not resolvable = range of distances overlaps with multiple buckets Make a guess in distrusting the distance counts into these buckets Several heuristics: 1.Left(right)most only 2.Even distribution 3.Proportional distribution bucket i bucket i +1 bucket i +2 Distance range

Experiments on approximate algorithms where h i : count of bucket i in the correct histogram h’ i : count of bucket i in the obtained histogram

What’s next? The error bound given by consulting the table is loose – Our heuristics work well  much of the 1-Є distances will be placed into the right bucket – What’s the tight bound? Continuous SDH query processing – SDH is often computed for a (large) number of consecutive frames – Save time by taking advantage of redundancy among neighboring frames

Summary The distance histogram is an important query in simulation databases We propose an algorithm based on a simple quad- tree-based data structure Analysis of the algorithm shows it outperforms the brute-force approach We develop an approximate algorithm with guaranteed error bound and very low time complexity Further investigations on the approximate algorithm may yield a more practical fast solution