© Vipin Kumar August 20, 2003 1 Discovery of Patterns in the Global Climate System using Data Mining Vipin Kumar Army High Performance Computing Research.

Slides:



Advertisements
Similar presentations
QMM 384 – Data Mining Data Mining: Introduction Introduction to Predictive Analytics.
Advertisements

CPS : Information Management and Mining Shivnath Babu.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Introduction to Data Mining by Tan, Steinbach, Kumar.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Data Mining: Introduction
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Extraction of high-level features from scientific data sets Eui-Hong (Sam) Han Department of Computer Science and Engineering University of Minnesota Research.
University of Minnesota
© Vipin Kumar CSci 8980 (Data Mining) Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Decision Support: Data Mining Introduction.
Data Mining: Introduction
Data Mining and Business Intelligence
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Data Mining: Introduction. Why Data Mining? l The Explosive Growth of Data: from terabytes to petabytes –Data collection and data availability  Automated.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Knowledge Discovery and Data Mining Evgueni Smirnov.
NGDM 2009 panel on Climate Change Mining Climate and Ecosystem Data : Challenges and Opportunities Vipin Kumar University of Minnesota.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
CSE4334/5334 DATA MINING CSE 4334/5334 Data Mining, Fall 2011 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan, Steinbach, Kumar 9/4/20071 Introduction to Data Mining Tan, Steinbach,
1 What is Data Mining? l Data mining is the process of automatically discovering useful information in large data repositories. l There are many other.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
1 Data Mining: Introduction Chapter 1 of Introduction to Data Mining by Tan, Steinbach, Kumar.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Minqi.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Introduction to Data Mining Jinze Liu April 8 th, 2009.
Discovery of Climate Indices using Clustering Michael Steinbach Steven Klooster Christopher Potter Rohit Bhingare, School of Informatics University of.
Christoph Eick Introduction to Data Mining 8/19/ Dr. Eick 2. COSC.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
COMSATS Institute of Information Technology Department of Computer Science Databases and Information Systems Dr. Ramzan Talib Databases and Information.
© Vipin Kumar IIT Mumbai Case Study 2: Dipoles Teleconnections are recurring long distance patterns of climate anomalies. Typically, teleconnections.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
An Introduction to Data Mining
Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Lecture Notes for Chapter 1 Introduction to Data Mining.
Data Mining: Introduction
Introduction to Game Data Mining
MIS2502: Data Analytics Advanced Analytics - Introduction
Data Mining Introduction
Data Mining: Introduction
Statistics 202: Statistical Aspects of Data Mining
Data Mining: Introduction
Introduction to Data Mining Part 1 Knowledge Sources COSC 6335 Webpage
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Introduction Lecture Notes for Chapter 1 Introduction to Data Mining by Tan,
Data Mining: Introduction
Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer.
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Data Mining: Introduction
Presentation transcript:

© Vipin Kumar August 20, Discovery of Patterns in the Global Climate System using Data Mining Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota Research sponsored by AHPCRC/ARL, DOE, NASA, and NSF

© Vipin Kumar August 20, What is Data Mining? l Many Definitions –Non-trivial extraction of implicit, previously unknown and potentially useful information from data –Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

© Vipin Kumar August 20, What is (not) Data Mining? l What is not Data Mining? – Look up phone number in phone directory – Query a Web search engine for information about “Amazon” l What is Data Mining? –Certain names are more prevalent in certain US locations (O’Brien, O’Rourke, … in Boston area) –Group together similar documents returned by search engine according to their context (Amazon rainforest, Amazon.com, etc.)

© Vipin Kumar August 20, Why Mine Data? Commercial Viewpoint l Lots of data is being collected and warehoused –Web data  Yahoo! collects  10GB/hour –purchases at department/ grocery stores  Walmart records  20 million transactions per day –Bank/Credit Card transactions l Computers have become cheaper and more powerful l Competitive Pressure is Strong –Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Why Mine Data? Scientific Viewpoint l Data collected and stored at enormous speeds (GB/hour) –remote sensors on a satellite  NASA EOSDIS archives over 1-petabytes of Earth Science data per year –telescopes scanning the skies  Sky survey data –gene expression data –scientific simulations  terabytes of data generated in a few hours l Traditional techniques infeasible for raw data l Data mining may help scientists –in automated analysis of massive data sets –in hypothesis formation

© Vipin Kumar August 20, Mining Large Data Sets - Motivation l There is often information “ hidden ” in the data that is not readily evident l Human analysts may take too long to discover useful information l Much of the data is never analyzed at all The Data Gap Total new disk (TB) since 1995 Number of analysts Ref: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications

© Vipin Kumar August 20, Origins of Data Mining l Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems l Traditional techniques may be unsuitable due to –Enormity of data –High dimensionality of data –Heterogeneous, distributed nature of data Machine Learning/ Pattern Recognition Statistics/ AI Data Mining Database systems

© Vipin Kumar August 20, Role of Parallel & Distributed Computing l High Performance Computing (HPC) is often critical for scalability to large data sets –Many algorithms use more than O(n) computation time –Sequential computers have limited memory, thus requiring multiple, expensive I/O passes over data l Distributed computing is needed because data is distributed –due to privacy reasons –physically dispersed over many different geographic locations Machine Learning/ Pattern Recognition Statistics/ AI High Performance Computing Data Mining Database systems

Data Mining Tasks... Predictive Modeling Clustering Association Rules Anomaly Detection Milk Data

© Vipin Kumar August 20, Predictive Modeling l Find a model for class attribute as a function of the values of other attributes Married Income  100K Income  80K YESNO Yes No Yes No Yes categorical continuous class Learn Classifier Model for predicting tax evasion

Predictive Modeling: Applications l Targeted Marketing l Customer Attrition/Churn l Classifying Galaxies Early Intermediate Late Sky Survey Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB Class: Stages of Formation Attributes: Image features, Characteristics of light waves received, etc. Courtsey:

© Vipin Kumar August 20, Clustering l Given a set of data points, find groupings such that –Data points in one cluster are more similar to one another –Data points in separate clusters are less similar to one another

© Vipin Kumar August 20, Clustering: Applications l Market Segmentation l Gene expression clustering l Document Clustering

© Vipin Kumar August 20, Association Rule Discovery l Given a set of records, find dependency rules which will predict occurrence of an item based on occurrences of other items in the record l Applications –Marketing and Sales Promotion –Supermarket shelf management –Inventory Management Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer} Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

© Vipin Kumar August 20, Deviation/Anomaly Detection l Detect significant deviations from normal behavior l Applications: –Credit Card Fraud Detection –Network Intrusion Detection Typical network traffic at University level may reach over 100 million connections per day

Discovery of Patterns in the Earth Science Data l Global snapshots of values for a number of variables on land surfaces or water l Data sources: l weather observation stations l earth orbiting satellites (since 1981) l modeled-based data NASA ESE questions: l How is the global Earth system changing? l What are the primary forcings? l How does Earth system respond to natural & human-induced changes? l What are the consequences of changes in the Earth system? l How well can we predict future changes?

© Vipin Kumar August 20, Climate Indices: Connecting the Ocean/Atmosphere and the Land El Nino Events Nino 1+2 Index l A climate index is a time series of sea surface temperature or sea level pressure l Climate indices capture teleconnections  The simultaneous variation in climate and related processes over widely separated points on the Earth

© Vipin Kumar August 20, Discovery of Climate Indices Using Clustering A novel clustering technique was developed to identify regions of uniform behavior in spatio- temporal data. The use of clustering for discovering climate indices is driven by the intuition that a climate phenomenon is expected to involve a significant region of the ocean or atmosphere where the behavior is relatively uniform over the entire area. A cluster-based approach for discovering climate indices provides better physical interpretation than those based on the SVD/EOF paradigm, and provide candidate indices with better predictive power than known indices for some land areas. Some SST clusters reproduce well-known climate indices. In particular, we were able to replicate the four El Nino SST- based indices: cluster 94 corresponds to NINO 1+2, 67 to NINO 3, 78 to NINO 3.4, and 75 to NINO 4. The correlations of these clusters to their corresponding indices are higher than 0.9. Some SST clusters, e.g., cluster 29, are significantly different than known indices, but provide better correlation with land climate variables than known indices for many parts of the globe. The bottom figure shows the difference in correlation to land temperature between cluster 29 and the El Nino indices. Areas in yellow indicate where cluster 29 has higher correlation.

© Vipin Kumar August 20, Mining the Climate Data: Clustering Clusters of SST that have high impact on land temperature El Nino Regions Defined by Earth Scientists # grid points: 67K Land, 40K Ocean Current data size range: 20 – 400 MB Monthly data over a range of 17 to 50 years

SST Cluster Moderately Correlated to Known Indices Ref: Steinbach et al 2002/2003 (KDD 2003)

Correlation of Known Indices with SST Cluster Centroids and SVD Components Climate Indices Cluster CentroidsSVD Components Best-shifted Correlation Best Centroid Best SVD Correlation Best Component SOI (G0) NAO (G2) AO (G1) PDO (G1) QBO (G1) CTI (G0) WP (G0) NINO (GO) NINO (G0) NINO (G0) NINO (G0)

© Vipin Kumar August 20, SLP Clusters DMI SOI NAO AO

© Vipin Kumar August 20, Pair of SLP Clusters that Correspond to SOI Centroids of SLP clusters 13 and 20Cluster centroid 20 – 13 versus SOI Correlation = 0.75

© Vipin Kumar August 20, Finding New Patterns: Indian Monsoon Dipole Mode Index l Recently a new index, the Indian Ocean Dipole Mode index (DMI), has been discovered. l DMI is defined as the difference in SST anomaly between the region 5S-5N, 55E-75E and the region 0-10S, 85E-95E. l DMI and is an indicator of a weak monsoon over the Indian subcontinent and heavy rainfall over East Africa. l We can reproduce this index as a difference of pressure indices of clusters 16 and 22. Plot of cluster 16 – cluster 22 versus the Indian Ocean Dipole Mode index. (Indices smoothed using 12 month moving average.)

© Vipin Kumar August 20, Mining the Climate Data: Associations FPAR-Hi ==> NPP-Hi (sup=5.9%, conf=55.7%) Grassland/Shrubland areas Association rule is interesting because it appears mainly in regions with grassland/shrubland vegetation type Ref: Tan et al 2001

© Vipin Kumar August 20, Release: 03-51AR NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL DISASTERS NASA is using satellite data to paint a detailed global picture of the interplay among natural disasters, human activities and the rise of carbon dioxide in the Earth's atmosphere during the past 20 years. Detection of Ecosystem Disturbances Detection of sudden changes in greenness over extensive areas from these large global satellite data sets required development of automated techniques that take into account the timing, location, and magnitude of such changes. An algorithm was designed to identify any significant and sustained declines in FPAR during an 18 year time period. This algorithm transforms a non-stationary time series to a sequence of disturbance events. Techniques were also developed to discover associations between ecosystem disturbance regimes and historical climate anomalies. These algorithms and techniques have allowed Earth Science researchers to gain a deeper insight into the interplay among natural disasters, human activities and the rise of carbon dioxide in Earth's atmosphere during two recent decades.

© Vipin Kumar August 20, Understanding Global Teleconnections of Climate to Regional Model Estimates of Amazon Ecosystem Carbon Fluxes Discovered, using correlation analysis, a strong connection between the rainfall patterns generated by the South American monsoon system and terrestrial greenness over a large section of the southern Amazon region. This is the first direct evidence of large-scale effects of the Atlantic Ocean rainfall systems on yearly greenness changes in the Amazon region, and the finding has important implications for the impacts of "slash and burn" deforestation on this crucial ecosystem of the world.

High Resolution EOS Data l EOS satellites provide high resolution measurements –Finer spatial grids  8 km  8 km grid produces 10,848,672 data points  1 km  1 km grid produces 694,315,008 data points –More frequent measurements –Multiple instruments  Generates terabytes of day per day l High resolution data allows us to answer more detailed questions: –Detecting patterns such as trajectories, fronts, and movements of regions with uniform properties –Finding relationships between leaf area index (LAI) and topography of a river drainage basin –Finding relationships between fire frequency and elevation as well as topographic position Earth Observing System (e.g., Terra and Aqua satellites)

© Vipin Kumar August 20, Discovery of Changes from the Global Carbon Cycle and Climate System Using Data Mining: Journal Publications l Potter, C., Tan, P., Steinbach, M., Klooster, S., Kumar, V., Myneni, R., Genovese, V., Major disturbance events in terrestrial ecosystems detected using global satellite data sets. Global Change Biology, July, l Potter, C., Klooster, S. A., Myneni, R., Genovese, V., Tan, P., Kumar,V Continental scale comparisons of terrestrial carbon sinks estimated from satellite data and ecosystem modeling Global and Planetary Change (in press) l Potter, C., Klooster, S. A., Steinbach, M., Tan, P., Kumar, V., Shekhar, S., Nemani, R., Myneni, R., Global teleconnections of climate to terrestrial carbon flux. Geophys J. Res.- Atmospheres (in press). l Potter, C., Klooster, S., Steinbach, M., Tan, P., Kumar, V., Myneni, R., Genovese, V., Variability in Terrestrial Carbon Sinks Over Two Decades: Part 1 – North America. Geophysical Research Letters (in press) l Potter, C. Klooster, S., Steinbach, M., Tan, P., Kumar, V., Shekhar, S. and C. Carvalho, Understanding Global Teleconnections of Climate to Regional Model Estimates of Amazon Ecosystem Carbon Fluxes. Global Change Biology (in press) l Potter, C., Zhang, P., Shekhar, S., Kumar, V., Klooster, S., and Genovese, V., Understanding the Controls of Historical River Discharge Data on Largest River Basins. (in preparation)

© Vipin Kumar August 20, Discovery of Changes from the Global Carbon Cycle and Climate System Using Data Mining: Conference/Workshop Publications l Steinbach, M., Tan, P. Kumar, V., Potter, C. and Klooster, S., Discovery of Climate Indices Using Clustering, KDD 2003, Washington, D.C., August 24-27, l Zhang, P., Huang, Y., Shekhar, S., and Kumar, V., Exploiting Spatial Autocorrelation to Efficiently Process Correlation-Based Similarity Queries, Proc. of the 8th Intl. Symp. on Spatial and Temporal Databases (SSTD '03) l Zhang, P., Huang, Y., Shekhar, S., and Kumar, V., Correlation Analysis of Spatial Time Series Datasets: A Filter-And-Refine Approach, Proc. of the Seventh Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD '03) l Ertoz, L., Steinbach, M., and Kumar, V., Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data, Proc. of Third SIAM International Conference on Data Mining. l Tan, P., Steinbach, M., Kumar, V., Potter, C., Klooster, S., and Torregrosa, A., Finding Spatio-Temporal Patterns in Earth Science Data, KDD 2001 Workshop on Temporal Data Mining, San Francisco l Kumar, V., Steinbach, M., Tan, P., Klooster, S., Potter, C., and Torregrosa, A., Mining Scientific Data: Discovery of Patterns in the Global Climate System, Proc. of the 2001 Joint Statistical Meeting, Atlanta