V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory.

Slides:

Advertisements

Similar presentations

Aggregating local image descriptors into compact codes

Advertisements

Wavelets Fast Multiresolution Image Querying Jacobs et.al. SIGGRAPH95.

Fast Algorithms For Hierarchical Range Histogram Constructions

Patch to the Future: Unsupervised Visual Prediction

Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.

Mining Time Series.

Hidden Markov Model based 2D Shape Classification Ninad Thakoor 1 and Jean Gao 2 1 Electrical Engineering, University of Texas at Arlington, TX-76013,

A Similarity Retrieval System for Multimodal Functional Brain Images Rosalia F. Tungaraza Advisor: Prof. Linda G. Shapiro Ph.D. Defense Computer Science.

Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.

Locally Constraint Support Vector Clustering

Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Leila Kaghazian.

© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,

Distinguishing Photographic Images and Photorealistic Computer Graphics Using Visual Vocabulary on Local Image Edges Rong Zhang,Rand-Ding Wang, and Tian-Tsong.

Multiple Human Objects Tracking in Crowded Scenes Yao-Te Tsai, Huang-Chia Shih, and Chung-Lin Huang Dept. of EE, NTHU International Conference on Pattern.

1 Integration of Background Modeling and Object Tracking Yu-Ting Chen, Chu-Song Chen, Yi-Ping Hung IEEE ICME, 2006.

Effective Gaussian mixture learning for video background subtraction Dar-Shyang Lee, Member, IEEE.

Visually Mining and Monitoring Massive Time Series Amy Karlson V. Shiv Naga Prasad 15 February 2004 CMSC 838S Images courtesy of Jessica Lin and Eamonn.

A Multiresolution Symbolic Representation of Time Series

A fuzzy video content representation for video summarization and content-based retrieval Anastasios D. Doulamis, Nikolaos D. Doulamis, Stefanos D. Kollias.

Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.

Presented by Arun Qamra

1 Dot Plots For Time Series Analysis Dragomir Yankov, Eamonn Keogh, Stefano Lonardi Dept. of Computer Science & Eng. University of California Riverside.

Time Series Data Analysis - II

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Pattern Matching with Acceleration Data Pramod Vemulapalli.

OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.

Data Mining Techniques

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.

Multimedia and Time-series Data

Vasileios Megalooikonomou Department of Computer Science Dartmouth College Mining Structure-Function Associations in a Brain Image Database.

Analysis of Constrained Time-Series Similarity Measures

FlowString: Partial Streamline Matching using Shape Invariant Similarity Measure for Exploratory Flow Visualization Jun Tao, Chaoli Wang, Ching-Kuang Shene.

Discovering the Intrinsic Cardinality and Dimensionality of Time Series using MDL BING HU THANAWIN RAKTHANMANON YUAN HAO SCOTT EVANS1 STEFANO LONARDI EAMONN.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.

TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

Mining Time Series.

80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

k-Shape: Efficient and Accurate Clustering of Time Series

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

Exact indexing of Dynamic Time Warping

© Devi Parikh 2008 Devi Parikh and Tsuhan Chen Carnegie Mellon University April 3, ICASSP 2008 Bringing Diverse Classifiers to Common Grounds: dtransform.

1 Neighboring Feature Clustering Author: Z. Wang, W. Zheng, Y. Wang, J. Ford, F. Makedon, J. Pearlman Presenter: Prof. Fillia Makedon Dartmouth College.

A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.

Streaming Pattern Discovery in Multiple Time-Series Jimeng Sun Spiros Papadimitrou Christos Faloutsos PARALLEL DATA LABORATORY Carnegie Mellon University.

NSF Career Award IIS University of California Riverside Eamonn Keogh Efficient Discovery of Previously Unknown Patterns and Relationships.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Jessica K. Ting Michael K. Ng Hongqiang Rong Joshua Z. Huang 國立雲林科技大學.

Statistical Analysis An Introduction to MRI Physics and Analysis Michael Jay Schillaci, PhD Monday, April 7 th, 2007.

Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.

WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.

ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

Project GuideBenazir N( ) Mr. Nandhi Kesavan RBhuvaneshwari R( ) Batch no: 32 Department of Computer Science Engineering.

Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.

Experience Report: System Log Analysis for Anomaly Detection

CSE 4705 Artificial Intelligence

Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)

Supervised Time Series Pattern Discovery through Local Importance

Unsupervised Riemannian Clustering of Probability Density Functions

Image Segmentation Techniques

Data Warehousing and Data Mining

A Similarity Retrieval System for Multimodal Functional Brain Images

Lecture 16. Classification (II): Practical Considerations

Presentation transcript:

V. Megalooikonomou, Temple University Clustering and Partitioning for Spatial and Temporal Data Mining Vasilis Megalooikonomou Data Engineering Laboratory (DEnLab) Dept. of Computer and Information Sciences Temple University Philadelphia, PA

V. Megalooikonomou, Temple University Outline Introduction –Motivation – Problems: Spatial domain Time domain –Challenges Spatial data –Partitioning and Clustering –Detection of discriminative patterns –Results Temporal data –Partitioning –Vector Quantization –Results Conclusions - Discussion

V. Megalooikonomou, Temple University Introduction Large spatial and temporal databases Meta-analysis of data pooled from multiple studies Goal: To understand patterns and discover associations, regularities and anomalies in spatial and temporal data

V. Megalooikonomou, Temple University Problem Spatial Data Mining: Given a large collection of spatial data, e.g., 2D or 3D images, and other data, find interesting things, i.e.: associations among image data or among image and non-image data discriminative areas among groups of images rules/patterns similar images to a query image (queries by content)

V. Megalooikonomou, Temple University Challenges How to apply data mining techniques to images? Learning from images directly Heterogeneity and variability of image data Preprocessing (segmentation, spatial normalization, etc) Exploration of high correlation between neighboring objects Large dimensionality Complexity of associations Efficient management of topological/distance information Spatial knowledge representation / Spatial Access Methods (SAMs)

V. Megalooikonomou, Temple University Example: Association Mining – Spatial Data Discover associations among spatial and non-spatial data: Images {i 1, i 2,…, i L } Spatial regions {s 1, s 2,…, s K } Non-spatial variables {c 1, c 2,…, c M } c1c1 c2c2 c3c3 c1c1 c7c7 c2c2 c9c9 c6c6 i1i1 i2i2 i3i3 i4i4 i5i5 i6i6 i7i7

V. Megalooikonomou, Temple University Example: fMRI contrast maps Control Patient

V. Megalooikonomou, Temple University Applications Medical Imaging, Bioinformatics, Geography, Meteorology, etc..

V. Megalooikonomou, Temple University Voxel-based Analysis No model on the image data Each voxel’s changes analyzed independently - a map of statistical significance is built Discriminatory significance measured by statistical tests (t- test, ranksum test, F-test, etc) Statistical Parametric Mapping (SPM) Significance of associations measured by chi-squared test, Fisher’s exact test (a contingency table for each pair of vars) Cluster voxels by findings [V. Megalooikonomou, C. Davatzikos, E. Herskovits, SIGKDD 1999]

V. Megalooikonomou, Temple University Analysis by grouping of voxels Grouping of voxels (atlas-based) Prior knowledge increases sensitivity Data reduction: 10 7 voxels R regions (structures) Map a ROI onto at least one region As good as the atlas being used M non-spatial variables, R regions Analysis Categorical structural variables Continuous structural variables M x R contingency tables, Chi-square/Fisher exact test multiple comparison problem log-linear analysis, multivariate Bayesian Logistic regression, Mann-Whitney

V. Megalooikonomou, Temple University Dynamic Recursive Partitioning Adaptive partitioning of a 3D volume

V. Megalooikonomou, Temple University Dynamic Recursive Partitioning Adaptive partitioning of a 3D volume Partitioning criterion: discriminative power of feature(s) of hyper-rectangle and size of hyper-rectangle

V. Megalooikonomou, Temple University Dynamic Recursive Partitioning Adaptive partitioning of a 3D volume Partitioning criterion: discriminative power of feature(s) of hyper-rectangle and size of hyper-rectangle

V. Megalooikonomou, Temple University Dynamic Recursive Partitioning Adaptive partitioning of a 3D volume Partitioning criterion: discriminative power of feature(s) of hyper-rectangle and size of hyper-rectangle

V. Megalooikonomou, Temple University Dynamic Recursive Partitioning Adaptive partitioning of a 3D volume Partitioning criterion: discriminative power of feature(s) of hyper-rectangle and size of hyper-rectangle Extract features from discriminative regions Reduce multiple comparison problem (# tests = # partitions < # voxels) tests downward closed [V. Megalooikonomou, D. Pokrajac, A. Lazarevic, and Z. Obradovic, SPIE Conference on Visualization and Data Analysis, 2002]

V. Megalooikonomou, Temple University Other Methods for Spatial Data Classification Distributional Distances: - Mahalanobis distance - Kullback-Leibler divergence (parametric, non- parametric) Maximum Likelihood: - Estimate probability densities and compute likelihood EM (Expectation-Maximization) method to model spatial regions using some base function (Gaussian) Static partitioning: Reduction of the # of attributes as compared to voxel-wise analysis Space partitioned into 3D hyper-rectangles (variables: properties of voxels inside hyper-rectangles) - incrementally increase discretization Distinguishing among distributions: D. Pokrajac, V. Megalooikonomou, A. Lazarevic, D. Kontos, Z. Obradovic, Artificial Intelligence in Medicine, Vol. 33, No. 3, pp , Mar * * * * * * * * *

V. Megalooikonomou, Temple University Experimental Results Areas discovered by DRP with t-test: significance threshold=0.05, maximum tree depth=3. Colorbar shows significance [D. Kontos, V. Megalooikonomou, D. Pokrajac, A. Lazarevic, Z. Obradovic, O. B. Boyko, J. Ford, F. Makedon, A. J. Saykin, MICCAI 2004] Number of tests Thresh. DepthDRPVoxel Wise Comparison of number of tests performed MethodClassification Accuracy (%) CriterionThresholdTree depthControlsPatientsTotal DRP correlation t-test ranksum Maximum Likelihood / EM Maximum Likelihood / k-means Kullback-Leibler / EM Kullback-Leibler / k-means776671

V. Megalooikonomou, Temple University Experimental Results Impact: Assist in interpretation of images (e.g., facilitating diagnosis) Enable researchers to integrate, manipulate and analyze large volumes of image data (a) (b) Discriminative sub-regions detected when applying (a) DRP and (b) voxel-wise analysis with ranksum test and significance threshold 0.05 to the real fMRI volume data

V. Megalooikonomou, Temple University Time Sequence Analysis Time series data abound in many applications … Challenges: –High dimensionality –Large number of sequences –Similarity metric definition Similarity analysis (e.g., find stocks similar to that of IBM) Goals: high accuracy, (high speed) in similarity searches among time series and in discovering interesting patterns Applications: clustering, classification, similarity searches, summarization Time Sequence: A sequence (ordered collection) of real values: X = x 1, x 2,…, x n

V. Megalooikonomou, Temple University Dimensionality Reduction Techniques DFT: Discrete Fourier Transform DWT: Discrete Wavelet Transform SVD: Singular Value Decomposition APCA: Adaptive Piecewise Constant Approximation PAA: Piecewise Aggregate Approximation SAX: Symbolic Aggregate approXimation …

V. Megalooikonomou, Temple University Similarity distances for time series A more intuitive idea: two series should be considered similar if they have enough non-overlapping time-ordered pairs of subsequences that are similar (Agrawal et al. VLDB, 1995) Euclidean Distance: most common, sensitive to shifts Envelope-based DTW: faster: O(n) Dynamic Time Warping: improving accuracy but slow: O(n 2 )

V. Megalooikonomou, Temple University Partitioning – Piecewise Constant Approximations Original time series (n points) Piecewise constant approximation (PCA) or Piecewise Aggregate Approximation (PAA), [Yi and Faloutsos ’00, Keogh et al, ’00] (n' segments) Adaptive Piecewise Constant Approximation (APCA), [Keogh et al., ’01] (n" segments)

V. Megalooikonomou, Temple University Multiresolution Vector Quantized approximation (MVQ) Partitions a sequence into equal-length segments and uses VQ to represent each sequence by appearance frequencies of key- subsequences 1) Uses a ‘vocabulary’ of subsequences (codebook) – training is involved 2) Takes multiple resolutions into account – keeps both local and global information 3) Unlike wavelets partially ignores the ordering of ‘codewords’ 3) Can exploit prior knowledge about the data 4) Employs a new distance metric [V. Megalooikonomou, Q. Wang, G. Li, C. Faloutsos, ICDE 2005]

V. Megalooikonomou, Temple University Methodology Codebook s=16 Generation Series Transformation Series Encoding …… c m d b c a i f a j b b m i n j j a ma I n j m h l d f k o p h c a k o o g c b l p o c c b l h l h n k k k p l c a c g k k g j h h g k g j l p …… s l

V. Megalooikonomou, Temple University Methodology Creating a ‘vocabulary’ Frequently appearing patterns in subsequences Output: A codebook with s codewords Q: How to create? A: Use Vector Quantization, in particular, the Generalized Lloyd Algorithm (GLA) Representing time series X = x 1, x 2,…, x n f = (f 1,f 2,…, f s ) is encoded with a new representation (f i is the frequency of the i th codeword in X)

V. Megalooikonomou, Temple University Methodology New distance metric: The histogram model is used to calculate similarity at each resolution level: wit h f i,t f i,q s

V. Megalooikonomou, Temple University Methodology Time series summarization: High level information (frequently appearing patterns) is more useful The new representation can provide this kind of information Both codeword (pattern) 3 & 5 show up 2 times

V. Megalooikonomou, Temple University Methodology Problems of frequency based encoding: It is hard to define an approximate resolution (codeword length) It may lose global information

V. Megalooikonomou, Temple University Methodology Solution: Use multiple resolutions: It is hard to define an approximate resolution (codeword length) It may lose global information

V. Megalooikonomou, Temple University Methodology Proposed distance metric: Weighted sum of similarities, at all resolution levels level i where c is the number of resolution levels lacking any prior knowledge equal weights to all resolution levels works well most of the time

V. Megalooikonomou, Temple University MVQ: Example of Codebooks Codebook for the first level Codebook for the second level (more codewords since there are more details)

V. Megalooikonomou, Temple University Experiments Datasets SYNDATA (control chart data): synthetic CAMMOUSE: 3 *5 sequences obtained using the Camera Mouse Program RTT: RTT measurements from UCR to CMU with sending rate of 50 msec for a day

V. Megalooikonomou, Temple University Experiments Best Match Searching: Matching accuracy: % of knn’s (found by different approaches) that are in same class

V. Megalooikonomou, Temple University Experiments Best Match Searching MethodWeight Vector Accuracy Single level VQ [ ]0.55 [ ]0.70 [ ]0.65 [ ]0.48 [ ]0.46 MVQ[ ]0.83 Euclidean0.51 SYNDATACAMMOUSE MethodWeight VectorAccuracy Single level VQ [ ]0.56 [ ]0.60 [ ]0.44 [ ]0.56 [ ]0.60 MVQ[ ]0.83 Euclidean0.58

V. Megalooikonomou, Temple University Experiments Best Match Searching (a) (b) Precision-recall for different methods (a) on SYNDATA dataset (b) on CAMMOUSE dataset MVQ

V. Megalooikonomou, Temple University Experiments Clustering experiments Given two clusterings, G=G 1, G 2, …, G K (the true clusters), and A = A 1, A 2, …, A k (clustering result by a certain method), the clustering accuracy is evaluated with the cluster similarity defined as: with [Gavrilov, M., Anguelov, D., Indyk, P. and Motwani, R., KDD 2000]

V. Megalooikonomou, Temple University Experiments Clustering experiments. MethodWeight Vector Accuracy Single level VQ [ ]0.69 [ ]0.71 [ ]0.63 [ ]0.51 [ ]0.49 MVQ[ ]0.82 DFT0.67 SAX0.65 DTW0.80 Euclidean0.55 SYNDATARTT MethodWeight Vector Accuracy Single level VQ [ ]0.55 [ ]0.52 [ ]0.57 [ ]0.80 [ ]0.79 MVQ[ ]0.81 DFT0.54 SAX0.54 DTW0.62 Euclidean0.50

V. Megalooikonomou, Temple University Experiments Summarization (SYNDATA) Typical series:

V. Megalooikonomou, Temple University Experiments First LevelSecond Level

V. Megalooikonomou, Temple University Given two time series t1 and t2 as follows: In the first level, they are encoded with the same codeword (3), so they are not distinguishable In the second level, more details are recorded. These two series have different encoded form: the first series is encoded with codeword 1 and 4, the second one is encoded with codewords 9 and 12. MVQ: Example: Two Time Series

V. Megalooikonomou, Temple University Hilbert Space Filling Curve Binning Statistical tests of significance on groups of points Identification of discriminative areas by back-projection (a) linear mapping of a 3D fMRI scan, (b) effect of binning by representing each bin with its V mean measurement, (c) the discriminative voxels after applying the t-test with θ=0.05 (a)(b)(c) Analysis of images by projection to 1D [D. Kontos, V. Megalooikonomou, N. Ghubade, and C. Faloutsos. IEEE Engineering in Medicine and Biology Society (EMBS), 2003]

V. Megalooikonomou, Temple University Areas discovered: (a) θ=0.05, (b) θ=0.01. The colorbar shows significance. (a) (b) Variation: Concatenate the values of statistically significant areas  spatial sequences Pattern analysis using the similarity between spatial sequences and time sequences SVD, DFT, DWT, PCA (clustering accuracy: %) Applying time series techniques Results: 87%-98% classification accuracy (t-test, CATX) [Q. Wang, D. Kontos, G. Li and V. Megalooikonomou, ICASSP 2004]

V. Megalooikonomou, Temple University Conclusions ‘Find patterns/interesting things’ efficiently and robustly in spatial and temporal data Use of partitioning and clustering Analysis at multiple resolutions Reduction of the number of tests performed Intelligent exploration of the space to find discriminative areas Reduction of dimensionality Symbolic representation Nice summarization

V. Megalooikonomou, Temple University Collaborators Faculty: Zoran Obradovic Orest Boyko James Gee Andrew Saykin Christos Faloutsos Christos Davatzikos Edward Herskovits Fillia Makedon Dragoljub Pokrajac Students: Despina Kontos Qiang Wang Guo Li Others: James Ford Alexandar Lazarevic

V. Megalooikonomou, Temple University Thank you! Acknowledgements This research has been funded by: –National Science Foundation CAREER award –National Science Foundation Grant –National Institutes of Health Grant R01 MH68066 funded by NIMH, NINDS, and NIA