System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.

Slides:

Advertisements

Similar presentations

Cyberinfrastructure for Coastal Forecasting and Change Analysis

Advertisements

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.

Data Mining Chun-Hung Chou

Energy Issues in Data Analytics Domenico Talia Carmela Comito Università della Calabria & CNR-ICAR Italy

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

Preparing for the Poster Session Gagan Agrawal. Outline Background on the proposal Overall research focus Equipment requested Preparing for the Site Visit.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Ohio State University 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan Ferhatosmanoglu Xutong Niu Ron Li Keith Bedford.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.

High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

HIERARCHICAL TREES OF UNSTEADY SIMULATION DATASETS Marek Gayer and Pavel Slavík C omputer G raphics G roup Department of Computer Science and Engineering.

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.

Computer Science and Engineering FREERIDE-G: A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Gagan Agrawal.

Apache Mahout Qiaodi Zhuang Xijing Zhang.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

Packet Size optimization for Supporting Coarse-Grained Pipelined Parallelism Wei Du Gagan Agrawal Ohio State University.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Research Overview Gagan Agrawal Associate Professor.

An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.

DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.

Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal The Ohio State University.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal.

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.

1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University

Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering

Sameh Shohdy, Yu Su, and Gagan Agrawal

Supporting Fault-Tolerance in Streaming Grid Applications

Year 2 Updates.

Scalable Data Mining: Algorithms, System Support, and Applications

Communication and Memory Efficient Parallel Decision Tree Construction

Agent-based Model Simulation with Twister

A Unifying View on Instance Selection

Data-Intensive Computing: From Clouds to GPU Clusters

Course Introduction CSC 576: Data Mining.

Fast and Exact K-Means Clustering

GATES: A Grid-Based Middleware for Processing Distributed Data Streams

Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How

Decision Trees for Mining Data Streams

A Grid-Based Middleware for Scalable Processing of Remote Data

Resource Allocation for Distributed Streaming Applications

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

LCPC02 Wei Du Renato Ferreira Gagan Agrawal

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Motivation: Data Mining Problem Datasets available from mining are often large Our understanding of what algorithms and parameters will give desired insights is limited Time required for implementing different algorithms and running them with different parameters on large datasets slows down the data mining process

Project Overview FREERIDE (Framework for Rapid Implementation of datamining engines) as the base system Already demonstrated for a variety of standard mining algorithms Working for end applications like feature analysis and mining of simulation data currently

FREERIDE offers:  The ability to rapidly prototype a high- performance mining implementation  Distributed memory parallelization  Shared memory parallelization  Ability to process large and disk-resident datasets  Only modest modifications to a sequential implementation for the above three

Key Observation from Mining Algorithms Popular algorithms have a common canonical loop Can be used as the basis for supporting a common middleware While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. }

Performance of Shared Memory Parallelization K-means clustering

Performance on Cluster of SMPs Apriori Association Mining Experiments on a cluster of SUN SMPs – purchased under a NSFRI to University of Delaware in 1997

SPIES On (a) FREERIDE Developed a new communication efficient decision tree construction algorithm – Statistical Pruning of Intervals for Enhanced Scalability (SPIES) Combines RainForest with statistical pruning of intervals of numerical attributes to reduce memory requirements and communication volume Does not require sorting of data, or partitioning and writing-back of records

Results from EM Clustering Algorithm EM is a popular data mining algorithm Can we parallelize it using the same support that worked for other clustering algo (k-means) and algo for other mining tasks Research supported by an NSF REU supplement (Leo Glimcher – computer science senior).

Broader Research Agenda

Applying FREERIDE for Scientific Data Mining Joint work with Machiraju and Parthasarathy Focusing on feature extraction, tracking, and mining approach developed by Machiraju et al. A feature is a region of interest in a dataset A suite of algorithms for extracting and tracking features

Future Work Support for multiple and/or heterogonous clusters Combine with DataCutter’s filter-streaming programming model Still allow high-level interface for mining algorithms Use FREERIDE support for reductions on SMPs More scalable shared memory parallelization Impact of modern memory hierarchies

Publications and Funding Refereed publications Ruoming Jin and Gagan Agrawal, ``A Middleware for Scalable Data Mining’’, proceedings of the first SIAM Conference on Data Mining, April 2001 Ruoming Jin and Gagan Agrawal, ``Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance’’, proceedings of the second SIAM Conference on Data Mining, April 2002 Ruoming Jin and Gagan Agrawal, ``Performance Prediction of Random Write Reductions’’, proceedings of the ACM SIGMETRICS conference, June 2002 Ruoming Jin and Gagan Agrawal, ``Memory and Communication Efficient Parallel Decision Tree Construction’’, proceedings of the third SIAM Conference on Data Mining, May 2003 Funding Research funding provided by NSF through CAREER award ACI , ACI , ACI , ACI , and ACI Equipment provided under NSF grants EIA (RI grant at Delaware) and EIA (Instrumentation grant at OSU) Participation of Leo Glimcher facilitated by an NSF REU supplement