Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan.

Similar presentations


Presentation on theme: "1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan."— Presentation transcript:

1 1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan

2 2 Outline Introduction Geoscientific Data Modeling Geoscientific Algebraic Operators Physical Data Model Parallel Query Execution Automatic Query Execution Heterogeneous Distributed Data Access Implementations and Experiences Conclusion References

3 3 Introduction Geoscience studies produce a tremendous amount of raw data Involves extracting interesting geoscientific phenomena not observed directly from raw datasets Cyclone tracks - trajectories traveled along low-pressure areas over time, that can be extracted from a sea-level pressure dataset Data mining in business applications and Geoscientific feature extraction involve sieving through large volumes of isolated events and data to locate salient patterns A database query processing problem in order to take advantage of automatic query optimization, parallelization techniques Conquest - an extensible parallel geoscientific query processing system

4 4 Geoscientific Data Model Example Geographic Data Field

5 5 Geoscientific Data Model A field - which associates parameter values with cells in a multidimensional coordinate space Cells can be of various geometric object types The type of cells and the coordinate space they lie in is determined by the Coordinate space Values for the cells lie in a multidimensional variable space Variable Attributes -The type of values associated with a cell in the coordinate space A cell record - a cell and the variable value associated with it Cell coverage - the set of distinct cells in the coordinate space for which variable values are recorded

6 6 Geoscientific Algebraic Operators A base set of general purpose logical field data manipulation operators. Users may introduce operators based on application specific algorithms Set-Oriented Relational operators - Selection, Projection, Cartesian Product, Union, Intersection, Set Difference, Join Sequence-Oriented Operators Grouping Operators - Nest and Unnest Space Conversion Operators

7 7 Physical Data Model Nesting of a Data Field

8 8 Parallel Query Execution Parallelization Techniques are used to remove bottlenecks in I/O and computation and improve query performance  Pipelining Processing or Dataflow Parallelism  Partitioning or Intra-Operator Parallelism  Multicasting

9 9 Query Parallelization Window of Relevance - Maximum length of time between arrival of an object and the time it ceases to have an effect on the execution state of the operator â Instantaneous â Known â Random but Bounded â Fixed Windows

10 10 Heterogeneous Distributed Data Access Only a small percentage of data is analyzed, due to unavailable storage, bandwidth and difficulty in integrating distributed datasets Conquest supports datasets both through distributed object interface and a repository- specific scanner operator, as accessing data from distributed objects eliminates opportunities for query capability of data repositories to optimize query evaluation

11 11 Implementations and Experiences Ported to run IBM SP1, SP2 and Intel Paragon Has been used for the past five years for exploratory data analysis and data mining of spatio-temporal phenomena produced at UCLA and also for extraction and analysis of cyclonic activity, blocking features, and oceanic warm pools. Number of upward wave propagation trajectories between 500mb and 50mb levels extracted per year

12 12 Implementations … (Contd.) Number of upward wave propagation trajectories between 500mb and 50mb at different latitudes

13 13 Conclusion Conquest - geoscientific data model that applies distributed and parallel database query processing to handle computationally expensive data mining queries on massive datasets. Helps analyze the large volumes of data to extract the necessary information Query Optimization emphasizes parallelization and optimal data access Future Work - This system is currently being integrated as part of a larger environment.

14 14 References E.C. Shek, R.R. Muntz, E. Mesrobian, and K. Ng, "Scalable Exploratory Data Mining of Distributed Geoscientific Data", KDD, 1996 E.C. Shek, E. Mesrobian, and R.R. Muntz, "On Heterogeneous Distributed Geoscientific Query Processing", Feb. 1996 F. Fabbrocino, E.C. Shek, R.R. Muntz, “ The Design and Implementation of the Conquest Query Execution Environment”, July. 1997 E. Mesrobian, et al…, "Exploratory Data Mining and Analysis Using Conquest", May 1995


Download ppt "1 Scalable Exploratory Data Mining of Distributed Geoscientific Data Authors : E.C Shek, R.R Muntz, E. Mesrobian and K. Ng by Sona Srinivasan."

Similar presentations


Ads by Google