Selectivity Estimation for Optimizing Similarity Query in Multimedia Databases IDEAL 2003 Paper review.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

LINIER-TIME SORTING AND ORDER STATISTICS Bucket Sort Radix Sort Randomized-Select Selection in linier time.
Ranking Multimedia Databases via Relevance Feedback with History and Foresight Support / 12 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT AND EXPLORATION.
Design of Experiments Lecture I
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
1 Chi-Square Test -- X 2 Test of Goodness of Fit.
Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Fast Parallel Similarity Search in Multimedia Databases (Best Paper of ACM SIGMOD '97 international conference)
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
 1  Outline  generation of random variates  convolution  composition  acceptance/rejection  generation of uniform(0, 1) random variates  linear.
Fan Qi Database Lab 1, com1 #01-08 CS3223 Tutorial 8.
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
What is Statistical Modeling
Prof. Dr. Ahmed Farouk Abdul Moneim. 1) Uniform Didtribution 2) Poisson’s Distribution 3) Binomial Distribution 4) Geometric Distribution 5) Negative.
CSC2515 Fall 2008 Introduction to Machine Learning Lecture 10a Kernel density estimators and nearest neighbors All lecture slides will be available as.ppt,.ps,
MARE 250 Dr. Jason Turner The Normal Distribution.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
CSE 803 Fall 2008 Stockman1 Veggie Vision by IBM Ideas about a practical system to make more efficient the selling and inventory of produce in a grocery.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Chapter 13 Conducting & Reading Research Baumgartner et al Data Analysis.
Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.
Face Recognition Based on 3D Shape Estimation
Texture Recognition and Synthesis A Non-parametric Multi-Scale Statistical Model by De Bonet & Viola Artificial Intelligence Lab MIT Presentation by Pooja.
Basics: Notation: Sum:. PARAMETERS MEAN: Sample Variance: Standard Deviation: * the statistical average * the central tendency * the spread of the values.
Analysis of Simulation Input.. Simulation Machine n Simulation can be considered as an Engine with input and output as follows: Simulation Engine Input.
CSE 803 Fall 2008 Stockman1 Veggie Vision by IBM Ideas about a practical system to make more efficient the selling and inventory of produce in a grocery.
EE565 Advanced Image Processing Copyright Xin Li Statistical Modeling of Natural Images in the Wavelet Space Parametric models of wavelet coefficients.
1 Some terminology Population - the set of all cases about which we have some interest. Sample - the cases we have selected from the population (randomly)
Space-Filling DOEs Design of experiments (DOE) for noisy data tend to place points on the boundary of the domain. When the error in the surrogate is due.
Problem 1 Given a high-resolution computer image of a map of an irregularly shaped lake with several islands, determine the water surface area. Assume.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Random Number Generators CISC/QCSE 810. What is random? Flip 10 coins: how many do you expect will be heads? Measure 100 people: how are their heights.
Non-Parametric Learning Prof. A.L. Yuille Stat 231. Fall Chp 4.1 – 4.3.
Quantitative Skills 1: Graphing
Advanced Higher Statistics Data Analysis and Modelling Hypothesis Testing Statistical Inference AH.
Toward the Next generation of Recommender systems
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
Goodness-of-Fit Chi-Square Test: 1- Select intervals, k=number of intervals 2- Count number of observations in each interval O i 3- Guess the fitted distribution.
Histograms for Selectivity Estimation
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Selecting Input Probability Distribution. Simulation Machine Simulation can be considered as an Engine with input and output as follows: Simulation Engine.
For information contact H. C. Koons 30 October Preliminary Analysis of ABFM Data WSR 11 x 11-km Average Harry Koons 30 October.
ECEN4503 Random Signals Lecture #24 10 March 2014 Dr. George Scheets n Read 8.1 n Problems , 7.5 (1 st & 2 nd Edition) n Next Quiz on 28 March.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Computational statistics, lecture3 Resampling and the bootstrap  Generating random processes  The bootstrap  Some examples of bootstrap techniques.
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
IE 300, Fall 2012 Richard Sowers IESE. 8/30/2012 Goals: Rules of Probability Counting Equally likely Some examples.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
Probabilistic Design Systems (PDS) Chapter Seven.
Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.
Modeling K The Common Core State Standards in Mathematics Geometry Measurement and Data The Number System Number and Operations.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Multi-Dimensional Databases & Online Analytical Processing This presentation uses some materials from: “An Introduction to Multidimensional Database Technology,”
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Statistical Methods Michael J. Watts
Data Transformation: Normalization
JPEG Compressed Image Retrieval via Statistical Features
Statistical Methods Michael J. Watts
Noisy Data Noise: random error or variance in a measured variable.
VOLUME FORMULA OF N-DIMENSIONAL HYPERSPHERE (ITERATIVE METHOD)
Spatial Online Sampling and Aggregation
Multidimensional Indexes
Similarity Search: A Matching Based Approach
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Data Transformations targeted at minimizing experimental variance
Wavelet-based histograms for selectivity estimation
Presentation transcript:

Selectivity Estimation for Optimizing Similarity Query in Multimedia Databases IDEAL 2003 Paper review

Query optimization in traditional database  Query: find the employee who’s age between and work for Engineering Faculty  Running time of different execution plans depend on  Number of employees between  Number of employees work for Engineering Faculty  Task: Estimate the number in advance and select the best execution plan (selectivity estimation)  Statistics are stored in database (metadata)

Techniques: one dimension  Parametric – unrealistic  Curve fitting – negative value problem  Sampling – large overhead  Non-parametric (Histogram technique) – widely used age

Problem in multimedia database  (Color = ‘red’) ^ (Shape = ‘round’)  Color, shape – feature vector  Multi-dimension  Number of buckets increases exponentially with dimension  Histogram technique fails  1d – 5  2d – 25  3d – 125  4d – 625

Previous Work – SIGMOD 99  Use DCT to compress information of histogram  2D example  Store DCT coefficient DCT Histogram valueDCT coefficients DCT

Reconstruction of histogram value DCT Zone sampling IDCT

Selectivity estimation

Current Work - IDEAL 2003  Extend the range query from hyper-cube to hyper- sphere  Model hyper-sphere as combination of hyper-cube  Task  Find combination of hyper-cubes to represent hyper-sphere  Find the area of overlapping

Generate combination of hyper- cube

Overlapping of hyper-cube with hyper- sphere  Monte-Carlo method  Generate uniformly distributed random point inside the hypercube  Count the number of points within the hyper-sphere  Use the ratio to estimate area of overlapping

Generate uniformly distributed points inside a hyper-sphere  Accept / Reject method  Generate points within hyper-cube  Accept those fall within the hyper-sphere  Greedy method  Generate θ uniformly [0,2π]  Generate r according to F -1 (U(0,1)) θ r

Experiment