Extraction of high-level features from scientific data sets Eui-Hong (Sam) Han Department of Computer Science and Engineering University of Minnesota Research.

Slides:



Advertisements
Similar presentations
An Introduction to Data Mining
Advertisements

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
SVM—Support Vector Machines
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Lecture Notes for Chapter 4 Introduction to Data Mining
Lecture Notes for Chapter 2 Introduction to Data Mining
Civil and Environmental Engineering Carnegie Mellon University Sensors & Knowledge Discovery (a.k.a. Data Mining) H. Scott Matthews April 14, 2003.
Decision Trees. DEFINE: Set X of Instances (of n-tuples x = ) –E.g., days decribed by attributes (or features): Sky, Temp, Humidity, Wind, Water, Forecast.
CSci 8980: Data Mining (Fall 2002)
Mining Sequence Patterns from Wind Tunnel Experimental Data Zhenyu Liu †, Wesley W. Chu †, Adam Huang ‡, Chris Folk ‡, Chih-Ming Ho ‡
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Week 9 Data Mining System (Knowledge Data Discovery)
CES 514 – Data Mining Lecture 8 classification (contd…)
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Lecture 5 (Classification with Decision Trees)
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Data Mining – Intro.
Data Mining Course Overview. About the course – Administrivia Instructor: George Kollios, MCS 288, Mon 2:30-4:00PM.
Chapter 5 Data mining : A Closer Look.
Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Where Are the Nuggets in System Audit Data? Wenke Lee College of Computing Georgia Institute of Technology.
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Knowledge Discovery and Data Mining Evgueni Smirnov.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Knowledge Discovery and Data Mining Evgueni Smirnov.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Updated Ozone CART Analysis, AQAST Meeting St. Louis, MO June 3-4, 2015.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Spatial Data Mining hari agung.
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Mining Turbulence Data Ivan Marusic Department of Aerospace Engineering and Mechanics University of Minnesota Collaborators: Victoria Interrante, George.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining and Decision Support
1 Illustration of the Classification Task: Learning Algorithm Model.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets 
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
An Introduction to Data Mining
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Illustrating Classification Task
Data Mining – Intro.
By Arijit Chatterjee Dr
Privacy-Preserving Data Mining
William Norris Professor and Head, Department of Computer Science
Sangeeta Devadiga CS 157B, Spring 2007
Data Mining: Introduction
Statistical Learning Dong Liu Dept. EEIS, USTC.
Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer.
Data Pre-processing Lecture Notes for Chapter 2
Presentation transcript:

Extraction of high-level features from scientific data sets Eui-Hong (Sam) Han Department of Computer Science and Engineering University of Minnesota Research Supported by NSF, DOE, Army Research Office, AHPCRC/ARL Joint Work with George Karypis, Ravi Jarnadan, Vipin Kumar, M. Pino Martin, Ivan Marusic, and Graham Candler

Scientific Data Sets Large amount of raw data available from scientific domains direct numerical simulations NASA satellite observations/climate data genomics astronomy How do we apply existing data mining techniques on these data sets?

Direct Numerical Simulation

El Nino Effects on the Biosphere C Potter and S. Klooster, NASA Ames Research Center

C4.5 Decision Trees categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attribute The splitting attribute is determined based on the Gini index or Entropy gain

Associations in Transaction Data Sets Dependency relations among collection of items appearing in transactions. Dependency relations among collection of items appearing in transactions. Frequent Item Sets: set of items that appear frequently together in transactions |{Diaper, Milk}| = 3 |{Diaper,Milk,Beer}| = 2 Association Rules Application Areas Inventory/Shelf planning Marketing and Promotion

Challenges of Applying Data Mining Techniques How do we construct transactions? in the presence of spatial attributes in the presence of temporal attributes What are “interesting’’ events in the transactions? high level objects (e.g., vortex in simulation) high level features (e.g., El Nino event in weather data) How do we find knowledge from the transactions and interesting events?

Feature extraction from simulation data using decision trees 3-D isosurface of “swirl strength” Velocity normal to the wall on XY plane (at z=30) Which features are important for high upward velocity on the XY plane?

Transaction construction Given 3D swirl strength data and corresponding velocity data on the XY plane at each simulation time step. swirl_strength(x,y,z) = 1 iff swirl strength at (x,y,z) > swirl threshold velocity(x,y) = 1 iff upward velocity at (x,y) > velocity threshold velocity(x,y) = -1 iff downward velocity at (x,y) > velocity threshold A transaction corresponds to a grid point on the XY plane at one time step. Class is velocity of the grid point Attributes correspond to swirl_strength(x,y,z) of the neighbors of the point x y Grid point z ss(-1:1,2:3,4:7)

C4.5 results on the simulation data Given simulation data of 1000 time points first 500 time points were used for training set second 500 time points were used for testing set 10% sample of class 0 transactions 95% classification accuracy Recall/precision of 0.83/0.95 for class -1 and 0.67/0.93 for class 1

Discovered Rules & Features (F1:ss(0,1,0) = 0 & ss(-1,-2:-3,-4:-7) = 1 & ss(-1:1,-2:-3,8:15) = 1 & ss(1,0,2:3) = 1) => class 1 (F2: ss(0,1,0) = 0 & ss(-1:1,-2:-3,-4:-7) = 0 & ss(1,-1,-2:-3) = 0 & ss(2:3,2:3,-16:-31) = 0 & ss(1:0:-1) = 0) => class 0 (F3: ss(0,1,0) = 0 & …. & ss(-2:-3,2:3,8:15) = 1) => class -1 F1 => class 1

How to use the discovered features? Finding association rules (F1, Vortex Type A) => (high energy, F5) Finding sequential patterns (F2, Vortex Type A) => (F3, Vortex Type B) => (class 1) Finding clusters of upward velocity points based on discovered features, vortex types, and other variables.

Finding functional relationships  Regression techniques find global and/or contiguous relationships  Association rules find local relationships with sufficient support  Need to find global relationships that have sufficient support

Finding functional relationships using duality transformation Duality transformation in 2D space Point p=(a,b) => line p’ : y=ax-b Line l: y=Ax-B => point l’=(A,B) p on l => l’ on p’ l=line between p and q => l’ = intersection of p’ and q’ a c b d (1,-1) a c b d y=x+1 Original space Transformed space Solution in the original space

Finding functional relationships using duality transformation Given n points in d dimension, find all hyperplanes that have at least k number of data points on the hyperplane. In the transformed space, given n hyperplanes in d dimension, find all the intersection points that have at least k hyperplanes. Efficient algorithms to find intersections exist. These intersections corresponds to the hyperplanes in the original space.

Functional relationships in synthetic data sets 1054 data points and 2000 noise points Found all the intersections of two points in the transformed space Drew a slope-sensitive grid on the transformed space Selected grids that have above threshold intersection points Plotted the average corresponding line of each selected grid on the original point space

Functional relationships in Ozone study Case Studies in Environmental Statistics, by D. Nychka, W. Piegorsch, and L. Cox ( b.book/index.html) daily maximum ozone measurement as parts per million (ppm), temperature, wind speed, etc from 04/01/81 to 10/31/91 over Chicago area found the most dominant functional relationship wspd = 0.09*ozone *temp + 2.9

Functional relationships in Ozone study Found a less dominant functional relationship wspd = 0.5*ozone - 0.4*temp This functional relationship covers only subset of data points on the lower levels of ozone measurement Potential follow up studies what is unique about this functional relationship? is there any unique characteristics of the supporting set?

How to use discovered functional relationships? Discover decision rules using both functional relationships and original variables. (supporting R1) and (Humidity > 80%) => class high- ozone-level Discover association rules and sequential patterns with these functional relationships ((supporting R2), Vortex Type A) => (high upward velocity) Comparative analysis of supporting sets of R1 and R2.

Research Issues in Finding Functional Relationships Non-linear relationships can be found by introducing extra variables like x^2, sin(x), exp(x) for every variable x. Spatial relationships can be found by introducing variables of neighbors. Temporal relationships can also be found by associating time stamp with variables.

Research Issues in Finding Functional Relationships High computational cost of O(n^d) where n is the number of data points and d is the number of variables in the relationships. Approximation algorithms are needed. Clustering data points to reduce n Focusing methods where inexact solutions are found using faster algorithms and more accurate relationships are found focusing on these inexact solutions. Iterative methods where the most dominant relationship is found first and less dominant relationships are found in the later iterations