Why Mine Data? Scientific Viewpoint l Data collected and stored at enormous speeds (GB/hour) –remote sensors on a satellite NASA EOSDIS archives over 1-petabytes of Earth Science data per year –telescopes scanning the skies Sky survey data –gene expression data –scientific simulations terabytes of data generated in a few hours l Traditional techniques infeasible for raw data l Data mining may help scientists –in automated analysis of massive data sets –in hypothesis formation
Predictive Modeling: Applications l Targeted Marketing l Customer Attrition/Churn l Classifying Galaxies Early Intermediate Late Sky Survey Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB Class: Stages of Formation Attributes: Image features, Characteristics of light waves received, etc. Courtsey:
Discovery of Patterns in the Earth Science Data l Global snapshots of values for a number of variables on land surfaces or water l Data sources: l weather observation stations l earth orbiting satellites (since 1981) l modeled-based data NASA ESE questions: l How is the global Earth system changing? l What are the primary forcings? l How does Earth system respond to natural & human-induced changes? l What are the consequences of changes in the Earth system? l How well can we predict future changes?
SST Cluster Moderately Correlated to Known Indices Ref: Steinbach et al 2002/2003 (KDD 2003)
Correlation of Known Indices with SST Cluster Centroids and SVD Components Climate Indices Cluster CentroidsSVD Components Best-shifted Correlation Best Centroid Best SVD Correlation Best Component SOI (G0) NAO (G2) AO (G1) PDO (G1) QBO (G1) CTI (G0) WP (G0) NINO (GO) NINO (G0) NINO (G0) NINO (G0)
High Resolution EOS Data l EOS satellites provide high resolution measurements –Finer spatial grids 8 km 8 km grid produces 10,848,672 data points 1 km 1 km grid produces 694,315,008 data points –More frequent measurements –Multiple instruments Generates terabytes of day per day l High resolution data allows us to answer more detailed questions: –Detecting patterns such as trajectories, fronts, and movements of regions with uniform properties –Finding relationships between leaf area index (LAI) and topography of a river drainage basin –Finding relationships between fire frequency and elevation as well as topographic position Earth Observing System (e.g., Terra and Aqua satellites)