Presentation is loading. Please wait.

Presentation is loading. Please wait.

O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July.

Similar presentations


Presentation on theme: "O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July."— Presentation transcript:

1 O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July 10-11, 2001 Nagiza Samatova & George Ostrouchov Computer Science and Mathematics Division Oak Ridge National Laboratory

2 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Science driven Bottlenecks 1) Data management and data mining algorithms:not scalable to petabytes of scientific data 2) Retrieving data subsets from storage systems: too slow, especially for tertiary storage 3) Transferring large datasets between sites is inefficient 4) Navigating between heterogeneous, distributed data sources very user intensive 5) I/O techniques: too low access rate Approaches: To improve the transfer of large datasets Major Focus:  To implement effective high-bandwidth transfers (Randy Burris)  To minimize the amount of data transferred

3 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Minimizing the amount of scientific simulation data transfer – State of the Art  Data compression utilities (zip, compress, etc.):  large overheads  modest compression rates  Post-processing data analysis tools (like PCMDI):  Scientists must wait for the simulation completion  can use lots of CPU cycles on long-running simulations  can use up to 50% more storage and require unnecessary data transfer for data-intensive simulations  Simulation monitoring tools:  interference with simulations  lack of flexibility

4 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Improvements through — Multi-level data minimization mechanisms I.Simulation level Data stream  not simulation  monitoring tools for: “Any-time” feedback to decide whether to terminate a simulation, restart with new parameters, or continue Filtering runs to decide whether to transfer to a central archive, keep locally, or delete II.Comparative analysis level Application-specific search engines for: Simulation data comparison, esp. against archived databases Distributed simulation data query, search, and retrieval III.In-depth analysis level Application-specific inference engines for: Inferring rules relating fragments in two or more simulation outputs New scientific discoveries

5 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 How we will address these needs  Our Approach: Develop ASPECT ( A daptable S imulation P roduct E xploration via C lustering T oolkit) that includes:  Dynamic first-look multivariate time series miner (Level I)  Distributed time-series query, search, and retrieval engine (Level II)  Time-series-based rules inference engine (Level III)  Our Strategy:  Leverage existing work  Expand our prior work  Integrate with other SDM tasks  Work closely with application scientists  Develop ASPECT in an iterative fashion

6 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Our work will be leveraging  Distributed Scientific Data Mining Research (Probe/MICS) [SOA+01a, SOA+01b]  Analysis of Large Scientific Datasets (LDRD/ORNL) [DFL+96, DFL+00, DFL+00]  Statistical Downscaling for Climate (LDRD/ORNL) [PDO00 ]

7 O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Distributed Scientific Data Mining Research (funded under Probe/MICS)  Motivation  Big picture  SDM-ETC related effort  Relevance to our task: Levels II and III  Limitations w.r.t. to our task:  Enabling Technology research not application-specific

8 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Motivation for Scientific Data Mining Research under Probe  Existing data mining tools have limited applicability to the emerging scientific data sets that are:  High-dimensional  Usual assumptions about homogeneity or ergodicity can not be made  Need segmented dimension reduction methods.  Massive (terabytes to petabytes)  Existing methods do not scale in terms of time, storage, number of dimensions.  Need scalable data analysis algorithms.  Distributed (e.g., across computational grids, multiple files, disks, tapes)  Existing methods work on a single, centralized dataset. Data transfer is prohibitive (high bandwidth, security/privacy concerns).  Need distributed data analysis algorithm.  Dynamic  Existing methods work with static datasets. Any changes require complete re-computation.  Need dynamic (updating & downdating) techniques.

9 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Our Approach – Distributed agents and peer-to-peer negotiation  Strategy  to perform data mining in a distributed and recursive fashion  with reasonable data transfer overheads  Key idea  Generate local components using distributed agents  Merge these components into a global system via peer-to-peer agents’ collaboration and negotiation  Requirements for Resulting System  Qualitative comparability  Computational complexity reduction  Scalability  Communication acceptability  Flexibility (in the choice of a local algorithm)  Visual representation sufficiency

10 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Background: Hierarchical Clustering AC DE B 60% Dendrogram Distance Matrix A C D B E 0.6.25.7 75% 40% Spanning Tree with Dissimilarity Measures A D BE.75.25.6.5.8 0.6.70.4 C

11 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 SDM-ETC Tie-in: Distributed Hierarchical Clustering  Given: A data set with N d-dimensional data items distributed across multiples data sites  Task: Determine a hierarchical decomposition of this dataset  Application of Clustering:  Database Management  Multi-dimensional indexing  Data Mining  and…. Problem Description:

12 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Local Dendrogram Local Dendrogram Local Dendrogram RACHET: Distributed Clustering Algorithm Global Dendrogram Distributed dendrograms Generate local dendrograms Centralized dendrograms Transmit local dendrograms Merge local dendrograms Global Dendrogram Increase k Improve Comparable Quality? Control flow of RACHET Reconstruct Geometry for visualization (optional) RACHET

13 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Features: vs. space cost Sufficient for efficiently calculating all measurements involved in making clustering decisions Sufficient for visualization Centroid Descriptive Statistics - summarized cluster representation is a cluster centroid of N c points - N c – number of data points in the cluster - – square norm of centroid - – radius of the cluster - – sum of centroid components - – minimum centroid component - – maximum centroid component Question How many statistical parameters are sufficient to make clustering decisions (merging or splitting clusters)?

14 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Updating Descriptive Statistics Let and be descriptive statistics of two clusters. Then the following statements hold for of cluster formed by merging and : Merging Theorem: S2S2 O S1S1 C1C1 C2C2

15 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Euclidean Distance Approximation Squared Euclidean Distance: transmission cost Lower and Upper Bounds:

16 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Performance Analysis: linear in time, space and transmission RACHET |S|<<N and k<<N O(N)

17 O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Analysis of Large Scientific Datasets  Focus: Univariate time series data  Applications: ARM, EEG  Relevance to our task: Level III  Limitations w.r.t. our task:  No support of dynamic & distributed time series  No support of multivariate time series

18 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Local Models For Global Analysis and Comparison of Data Series  Strategy  Segment series  Model the usual to find the unusual  Key ideas  Fit simple local models to segments  Use parameters for global analysis and monitoring  Resulting system  Detects specific events (targeted or unusual)  Provides a global description of one or several data series  Provides data reduction to parameters of local model

19 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 From Local Models to Annotated Time Series Segment series (100 obs) Fit simple local model ( c 0, c 1, c 2, ||e|| , ||e|| 2 ) Select extreme (10%) Cluster extreme (4) Map back to series

20 O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Statistical Downscaling for Climate  Focus: Image time series  Application: Climate  Relevance to our task: Levels I and II  Limitation w.r.t. our task: Works as a post-processing tool

21 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Climate Downscaling Contains Several Post-Processing Tools

22 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Trend and Periodic Components Provide a Concise Description of Model Run Filter periodic and trend components Compute EOFs Monitor model run

23 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Summary of where efforts are needed  Research:  Multivariate time series datasets  Dynamic versions of time series processing & analysis tools  Application-specific distributed & dynamic clustering  Application-specific rules inference algorithms  Implementations:  ASPECT’s framework Simulation data monitoring engine: with pluggable user-driven data analysis modules with “any-time”, “real-time” not post-processing with no or very little interference with simulation Simulation data query, search, & retrieval engine Simulation data rules inference engine  A lot of integration work…

24 OAK RIDGE NATIONAL LABORATORY U.S. DEPARTMENT OF ENERGY SDM meeting, July 10-11, 2001 Integration with other SDM-ETC tasks


Download ppt "O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Multi-agent based High-Dimensional Cluster Analysis SciDAC SDM-ISIC Kickoff Meeting July."

Similar presentations


Ads by Google