FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.

Slides:



Advertisements
Similar presentations
Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.
Advertisements

Spatial Database Systems. Spatial Database Applications GIS applications (maps): Urban planning, route optimization, fire or pollution monitoring, utility.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
Dimensionality Reduction PCA -- SVD
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
Multimedia Indexing and Retrieval Kowshik Shashank Project Advisor: Dr. C.V. Jawahar.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Dimensionality Reduction and Embeddings
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.
Dimensionality Reduction
1 Visualizing the Legislature Howard University - Systems and Computer Science October 29, 2010 Mugizi Robert Rwebangira.
Based on Slides by D. Gunopulos (UCR)
Spatial and Temporal Data Mining
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
Flattening via Multi- Dimensional Scaling Ron Kimmel Computer Science Department Geometric Image Processing Lab Technion-Israel.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Dimensionality Reduction
Spatial Database Systems. Spatial Database Applications GIS applications (maps): Urban planning, route optimization, fire or pollution monitoring, utility.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
Indexing Time Series.
Footer Here1 Feature Selection Copyright, 1996 © Dale Carnegie & Associates, Inc. David Mount For CMSC 828K: Algorithms and Data Structures for Information.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Multimedia and Time-series Data
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RobustMap: A Fast and Robust Algorithm for Dimension Reduction and Clustering Lionel F.
Lionel F. Lovett, II Jackson State University Research Alliance in Math and Science Computer Science and Mathematics Division Mentors: George Ostrouchov.
General Tensor Discriminant Analysis and Gabor Features for Gait Recognition by D. Tao, X. Li, and J. Maybank, TPAMI 2007 Presented by Iulian Pruteanu.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.
A hybrid SOFM-SVR with a filter-based feature selection for stock market forecasting Huang, C. L. & Tsai, C. Y. Expert Systems with Applications 2008.
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He.
Multidimensional Scaling by Deterministic Annealing with Iterative Majorization Algorithm Seung-Hee Bae, Judy Qiu, and Geoffrey Fox SALSA group in Pervasive.
Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Jan Kamenický.  Many features ⇒ many dimensions  Dimensionality reduction ◦ Feature extraction (useful representation) ◦ Classification ◦ Visualization.
An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
MindReader: Querying databases through multiple examples Yoshiharu Ishikawa (Nara Institute of Science and Technology, Japan) Ravishankar Subramanya (Pittsburgh.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and.
A Fast LBG Codebook Training Algorithm for Vector Quantization Presented by 蔡進義.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Dimensionality Reduction CS 685: Special Topics in Data Mining Spring 2008 Jinze.
Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Dimensionality Reduction Part 1: Linear Methods Comp Spring 2007.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Image Retrieval and Ranking using L.S.I and Cross View Learning Sumit Kumar Vivek Gupta
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Fast Subsequence Matching in Time-Series Databases.
Spatial Data Management
What Is Cluster Analysis?
Deep Compositional Cross-modal Learning to Rank via Local-Global Alignment Xinyang Jiang, Fei Wu, Xi Li, Zhou Zhao, Weiming Lu, Siliang Tang, Yueting.
Principal Component Analysis (PCA)
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
Lecture 22 Clustering (3).
Efficient Record Linkage in Large Data Sets
Presentation transcript:

FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets

AbstractAbstract  Describe a fast algorithm to map objects into points in some k- dimensional space, such that the dis- similarities are preserved.

AbstractAbstract  Thus, we can subsequently use fine- tuned spatial access methods (SAMs) to answer queries such as “ query by example ” or “ all pairs query ”.

IntroductionIntroduction  Not easy to extract k feature-extraction functions, which map to k-dimensional points  For instance, typed English words, what distance function should we consider to transform one string to the other?

SolutionsSolutions  Old : Multi-Dimensional Scaling (MDS)  Unsuitable for indexing  Proposed : Fast Algorithm  Much faster  Allow indexing

ApplicationsApplications  Image and multimedia databases  Medical databases

ApplicationsApplications  String databases, e.g. OCR  Time series, e.g. financial data

ApplicationsApplications  Data mining and visualization applications

Desirable types of queries  query-by-example search a collection of objects to find the ones that are within a user-defined distance from the query object  all pairs query find the pairs of objects which are within distance from each other

Benefit of mapping objects  Accelerate the search time for queries, by employing SAMs like R*-trees and z-ordering  Help with visualization, clustering and data-mining

Ideal mapping fulfills …  Fast to compute: O(N) or O(N logN), but not O(N 2 )  Preserve distances with little discrepancies  Should be very fast to map a new object

MDSMDS  Used to discover the underlying (spatial) structure of a set of data items from the (dis)similarity information  Map objects to a k-dimensional space, so as to minimize the stress function

MDSMDS  Stress function  it is the average difference between the distance of the "images" and the actual distance.

Drawbacks of MDS  Requires O(N 2 ) time, which is impractical for large databases  Fast retrieval is questionable as MDS is not prepared for “ query-by-example ” operation

DefinitionsDefinitions  k-d point P i that corresponds to the object O i, will be called the ‘image’ of object O i. That is, P i = (x i,1, x i,2,…, x i,k)  k-d space containing ‘images’ will be called target space

Proposed algorithm  Assumption: a domain expert has only provided us with a distance/dis- similarity function D (*, *)  For instance, the Euclidean distance between two feature vectors as the distance function between the corresponding objects

Proposed algorithm  Pretend that objects are indeed points in some unknown n-dimensional space, and to try to project these points on k mutually orthogonal directions  The challenge is to compute these projections from the distance matrix only

Proposed algorithm  Project the objects on a carefully selected “ line ”  Choose O a and O b be “ pivot objects ”

Proposed algorithm  compute the distance of each point from the pivot points using only information we know, i.e., the distances between objects

Proposed algorithm OaOb Oi Xi

Proposed algorithm  By Cosine Law, in any triangle O a O i O b d b,i 2 = d a,i 2 + d a,b 2 – 2x i d a,b  d i,j the shorthand for the distance D (O i, O j )

Proposed algorithm  By simple math manipulation Xi = (d a,i 2 + d a,b 2 - d b,i 2 ) / 2d a,b  We can map objects into points on a line, preserving some of the distance information

Proposed algorithm  Solved 2-d space  Extend to higher dimensions

Proposed algorithm  Determines the coordinates of the N objects on a new axis, after each of k recursive calls  Record the “ pivot objects ” in each recursive call is to facilitate queries  Choose pivots objects by heuristic algorithm

Proposed algorithm  All steps are linear  Complexity is O(N k)

ExperimentsExperiments  Compare FastMap with MDS  speed and quality  Illustrate the visualization and clustering abilities  real and synthetic datasets

Comparison with MDS  Response time vs. no. of database size

Comparison with MDS  Response time vs. no. of dimensions k

Comparison with MDS  Response time vs. stress

Clustering/visualization properties of FastMap

ConclusionConclusion  A fast algorithm to map objects into points in k-d space  Accelerate searching by highly optimized SAMs e.g. R-trees, R*-trees etc.  Application of the algorithm to multimedia database, data-mining, clustering and document retrieval etc.

ReferenceReference  Christos Faloutsos, King-Ip (David) Lin FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets  Joseph B. Kruskal, Myron Wish Multidimensional scaling