DDM Kirk. LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Decision Tree Induction in Hierarchic Distributed Systems With: Amir Bar-Or, Ran Wolff, Daniel Keren.
Big Data Kirk Borne George Mason University LSST All Hands Meeting August , 2012.
1 Welcome to the Kernel-Class My name: Max (Welling) Book: There will be class-notes/slides. Homework: reading material, some exercises, some MATLAB implementations.
Machine learning continued Image source:
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Week 9 Data Mining System (Knowledge Data Discovery)
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Unsupervised Learning
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Dimensionality Reduction
Applications of Data Mining in Microarray Data Analysis Yen-Jen Oyang Dept. of Computer Science and Information Engineering.
Data Mining – Intro.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
FLANN Fast Library for Approximate Nearest Neighbors
Knowledge Discovery from Mining Big Data Kirk George Mason University School of Physics, Astronomy, & Computational Sciences
Introduction to machine learning
Machine Learning in Simulation-Based Analysis 1 Li-C. Wang, Malgorzata Marek-Sadowska University of California, Santa Barbara.
Data Warehouse Fundamentals Rabie A. Ramadan, PhD 2.
Data Mining Techniques
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
嵌入式視覺 Pattern Recognition for Embedded Vision Template matching Statistical / Structural Pattern Recognition Neural networks.
Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,
Surprise Detection in Science Data Streams Kirk Borne Dept of Computational & Data Sciences George Mason University
National Center for Supercomputing Applications Observational Astronomy NCSA projects radio astronomy: CARMA & SKA optical astronomy: DES & LSST access:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
Chapter 1 Introduction to Data Mining
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.
1 Machine Learning and Data Mining for Automatic Detection and Interpretation of Solar Events Jie Zhang (Presenting, Co-I, SCS*) Art Poland (PI, SCS*)
Mining Weather Data for Decision Support Roy George Army High Performance Computing Research Center Clark Atlanta University Atlanta, GA
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Image Classification for Automatic Annotation
The Restricted Matched Filter for Distributed Detection Charles Sestok and Alan Oppenheim MIT DARPA SensIT PI Meeting Jan. 16, 2002.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Using Adaptive Tracking To Classify And Monitor Activities In A Site W.E.L. Grimson, C. Stauffer, R. Romano, L. Lee.
Data Mining and Decision Support
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Scientific Data Analysis via Statistical Learning Raquel Romano romano at hpcrd dot lbl dot gov November 2006.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Data Mining Techniques Applied in Advanced Manufacturing PRESENT BY WEI SUN.
Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
T. Axelrod, NASA Asteroid Grand Challenge, Houston, Oct 1, 2013 Improving NEO Discovery Efficiency With Citizen Science Tim Axelrod LSST EPO Scientist.
Introduction to Machine Learning, its potential usage in network area,
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
What Is Cluster Analysis?
School of Computer Science & Engineering
Supervised Time Series Pattern Discovery through Local Importance
Machine Learning Basics
Data Warehousing and Data Mining
Knowledge Discovery from Mining Big Data
Course Introduction CSC 576: Data Mining.
CS4670: Intro to Computer Vision
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
PolyAnalyst Web Report Training
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.
Restructuring Sparse High Dimensional Data for Effective Retrieval
Presentation transcript:

DDM Kirk

LSST-VAO discussion: Distributed Data Mining (DDM) Kirk Borne George Mason University March 24, 2011

The LSST Data Challenges

The LSST Data Mining Challenges 1.Massive data stream: ~2 Terabytes of image data per hour that must be mined in real time (for 10 years). 2.Massive 20-Petabyte database: more than 50 billion objects need to be classified, and most will be monitored for important variations in real time. 3.Massive event stream: knowledge extraction in real time for 100,000 events each night. Challenge #1 includes both the static data mining aspects of #2 and the dynamic data mining aspects of #3. Challenge #1 includes both the static data mining aspects of #2 and the dynamic data mining aspects of #3. Look at #2 and #3 in more detail... Look at #2 and #3 in more detail...

LSST data mining challenge # 2 Accurately characterize and classify 50 billion objects and 20 trillion source observations Requires VO-accessible multi-wavelength data Szalay’s Law: Astrophysical discovery potential grows as (number of data sources) 2 Benefits of very large datasets: best statistical analysis of “typical” events automated search for “rare” events

LSST data mining challenge # 3 Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data:

LSST data mining challenge # 3 Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: time flux

LSST data mining challenge # 3 Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help ! time flux

LSST data mining challenge # 3 Approximately 100,000 times each night for 10 years LSST will obtain the following data on a new sky event, and we will be challenged with classifying these data: more data points help ! time flux Characterize first ! then Classify.

Characterization Use Case #1 Feature detection and extraction: –Automated pipelines’ tasks: Characterize! Identify and describe features in the data Extract feature descriptors from the data Curating these features for scientific re-use –Human experts’ tasks: Categorize and Classify! Associate features with astrophysical processes Find boundaries between feature sets and label them –Example: Star-Galaxy Separation

Characterization Use Case #2 The clustering problem: –Finding clusters of objects within a data set –Pipeline: apply an optimal algorithm for finding friends-of-friends or nearest neighbors N is >10 10, so what is the most efficient way to sort? Number of dimensions ~ 1000 – therefore, we have an enormous subspace search problem –Scientist: determine the significance of the clusters (statistically and scientifically) – categorize!

Outlier detection: (unknown unknowns) –Finding the objects and events that are outside the bounds of our expectations (outside known clusters) –These may be real scientific discoveries or garbage –Outlier detection is therefore useful for: Novelty Discovery – is my Nobel prize waiting? Anomaly Detection – is the detector system working? Data Quality Assurance – is the data pipeline working? –How does one optimally find outliers in D parameter space? or in interesting subspaces (in lower dimensions)? –How do we measure their “interestingness”? Characterization Use Case #3

The dimension reduction problem: –Finding correlations and “fundamental planes” of parameters –Number of attributes can be hundreds or thousands The Curse of High Dimensionality ! –Are there combinations (linear or non-linear functions) of observational parameters that correlate strongly with one another? –Are there eigenvectors or condensed representations (e.g., basis sets) that represent the full set of properties? Characterization Use Case #4

What’s the common theme? Need multi-wavelength data in all use cases! VO-accessible ancillary information is essential. The LSST Data Mining Challenges:

Requirements for success: Discovery of distributed data sources Access to distributed data sources Applying characterization and clustering (data mining) algorithms on distributed data: Unsupervised and Supervised Machine Learning What’s the common theme? Need multi-wavelength data in all use cases! VO-accessible ancillary information is essential. The LSST Data Mining Challenges:

Data Bottleneck Mismatch: Data volumes increase 1000x in 10 yrs I/O bandwidth improves ~3x in 10 years Therefore... Distributed Data Mining

Distributed Data Mining (DDM) DDM comes in 2 types: 1.Mining of Distributed Data (MDD) 2.Distributed Mining of Data (DMD) Type 1 takes many forms, with data being centralized (in whole or in partitions) Type 2 requires sophisticated algorithms that operate with data in situ … Ship the Code to the Data The computations are done on the data locally, with partial results shipped around to the different data nodes, and the DDM algorithm iterates until a solution is converged upon. This can be pipeline-initiated or scientist end-user-initiated. References: Ultimate goal: Knowledge Discovery through Data Discovery