Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 D2K –

Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K – Data To Knowledge March 19, 2004 Duke University

alg | Automated Learning Group Outline Overview of Data Mining Overview of D2K Functionality D2K Toolkit MAIDS – Mining Streaming Data D2K Driven Application ThemeWeaver – Mining Text Data MAEViz – Visualizing Earthquake Damage Analysis D2K Streamline (SL) EMO – Finding Optimal Decisions D2K Web Service Phylomat – Finding Motifs in Sequences

alg | Automated Learning Group ALG Mission The specific mission of the Automated Learning Group is: To collaborate with researchers to develop novel computer methods and the scientific foundation for using historical data to improve future decision making To work closely with industrial, government, and academic partners to explore new application areas for such methods, and To transfer the resulting software technology into real world applications

alg | Automated Learning Group ALG Research, Development, & Technology Transfer Model

alg | Automated Learning Group What is It? Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data The understandable patterns are used to: Make predictions about or classifications of new data Explain existing data Summarize the contents of a large database to support decision making Create graphical data visualization to aid humans in discovering complex patterns Overview of Knowledge Discovery

alg | Automated Learning Group Why Do We Need Data Mining ? Data volumes are too large for classical analysis approaches: Large number of records (10 8 – 10 12 bytes) High dimensional data ( 10 2 – 10 4 attributes) How do you explore millions of records, tens or hundreds or thousands of fields, and find patterns? As databases grow, the ability to use traditional query languages for the decision support process becomes infeasible Many queries of interest are difficult to state in a query language (query formulation problem) “Find all cases of fraud” “Find all individuals likely to by Ford Explorer” “Find all documents that are similar to this customers problem” Overview of Knowledge Discovery

alg | Automated Learning Group Knowledge Discovery Process Overview of Knowledge Discovery

alg | Automated Learning Group Required Effort for each KDD Step Arrows indicate the direction we want the effort to go Overview of Knowledge Discovery

alg | Automated Learning Group Three Primary Paradigms Predictive Modeling – supervised learning approach where classification or prediction of one of the attributes is desired Classification is the prediction of predefined classes –Naive Bayesian, Decision Trees, and Neural Networks Regression is the prediction of continuous data –Neural Networks, and Decision (Regression) Trees Discovery – unsupervised learning approach for exploratory data analysis Association Rules and Link Analysis Clustering and Self Organizing Maps Deviation Detection – identifying outliers in the data Visualization Overview of Knowledge Discovery

alg | Automated Learning Group Importance of Data Mining Framework Provides capability to build custom applications Provides access to data management tools Contains data mining algorithms for prediction and discovery Provides data transformations for standard operations Supports an extensible interface for creating one’s own algorithms Provides means for building and applying models Provides integrated visualizations components Provides access to distributed computing capabilities

alg | Automated Learning Group D2K - Data To Knowledge D2K is a flexible data mining system that integrates effective analytical data mining methods for prediction, discovery, and anomaly detection with data management and information visualization D2K Overview

alg | Automated Learning Group D2K and Its Many Components D2K Infrastructure D2K API, data flow environment, distributed computing framework and runtime system D2K Modules Computational units written in Java that follow the D2K API D2K Itineraries Modules that are connected to form an application D2K Toolkit User interface for specification of itineraries and execution that provides the rapid application development environment D2K-Driven Applications Applications that use D2K modules with a custom user interface D2K Streamline (SL) Task driven system that uses D2K modules D2K Web/Grid Services Enables web deployment D2K Overview

alg | Automated Learning Group D2K Toolkit Major features that D2K provides to an application developer include: Visual programming system employing a data flow paradigm Scalable distributed computing capabilities Flexible and extensible software development environment Multi-layered learning strategies Integrated environment for models and visualization Web service capabilities for deployment D2K Overview

alg | Automated Learning Group D2K Modules Input Module: Loads data from the outside world Flat files, database, etc. Data Prep Module: Performs functions to select, clean, or transform the data Binning, Normalizing, Feature Selection, etc. Compute Module: Performs main algorithmic computations Naïve Bayesian, Decision Tree, Apriori, etc. User Input Module: Requires interaction with the user Data Selection, Input and Output selection, etc. Output Module: Saves data to the outside world Flat files, databases, etc. Visualization Module: Provides visual feedback to the user Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D Scatterplot, 3D Surface Plot D2K Overview

alg | Automated Learning Group D2K Module Icon Description Module Progress Bar Appears during execution to show the percentage of time that this module executed over the entire execution time. It is green when the module is executing and red when not Input Port Rectangular shapes on the left side of the module represent the inputs for the module. They are colored according to the data type that they represent Properties Symbol If a “P” is shown in the lower left corner of the module, then the module has properties that can be set before execution Output Port Rectangular shapes on the right side of the module represent the outputs for the module. They are colored according to the data type that they represent D2K Overview

alg | Automated Learning Group MAIDS: Mining Alarming Incidents in Data Streams Stream Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive— single linear scan algorithm (can only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi- dimensional in nature, needs multi-level and multi- dimensional processing Current ALG Projects

alg | Automated Learning Group MAIDS Using D2K Toolkit

alg | Automated Learning Group Text Mining Information Retrieval Indexing and retrieval of textual documents Finding a set of (ranked) documents that are relevant to the query Information Extraction Extraction of partial knowledge in the text Web Mining Indexing and retrieval of textual documents and extraction of partial knowledge using the web Classification Predict a class for each text document Clustering Generating collections of similar text documents Current ALG Projects

alg | Automated Learning Group Text Mining: Views from T2K and ThemeWeaver Using D2K Driven Application

alg | Automated Learning Group MAEViz: Damage Synthesis Visualization Using D2K Driven Application Displays terrain map Loads hazard, inventory, and fragility data Shows contour map of ground acceleration (hazard) Displays cones/bars to indicate level of damage Overlays shapefiles of different information Uses VTK for 3D Uses CUBE at BI

alg | Automated Learning Group D2K Streamline (D2K SL) Provides step by step interface to guide user in data analysis Supports return to earlier steps to run with different parameters Uses the D2K infrastructure transparently Uses same D2K modules Provides way to capture different experiments D2K SL

alg | Automated Learning Group EMO – Evolutionary Multiobjective Optimization Using D2K SL Identify tradeoffs among complex objectives Apply a genetic algorithm (GA) optimization in a general framework Guide the user through discrete steps to defining decision variables, fitness functions, constraints, and setting up GA parameters

alg | Automated Learning Group D2K Web Service Architecture Any web enabled client can connect to and use the D2K Web Service by sending SOAP messages over HTTP. Itineraries and modules are stored on the web service machine and loaded over the network by the D2K Servers. Job results are also stored in the web service tier. Results are returned to clients upon request. A relational database is used by the web service to lookup accounts, itineraries, servers, and jobs. Remote D2K Servers handle itinerary processing. If possible, modules should load any data from remote locations.

alg | Automated Learning Group Phylomat (Motif Analysis Tool for Phylogenomics) Using D2K Web Service

alg | Automated Learning Group The ALG Team Staff Loretta Auvil Peter Bajcsy Colleen Bushell Dora Cai David Clutter Lisa Gatzke Vered Goren Chris Navarro Greg Pape Tom Redman Duane Searsmith Andrew Shirk Anca Suvaiala David Tcheng Michael Welge Students Ritesh Agrawal Tyler Alumbaugh John Cassel Sang-Chul Lee Xiaolei Li Jeff Ng Scott Ramon Martin Urban Bei Yu Hwanjo Yu

alg | Automated Learning Group Licensing D2K Faculty, staff and students at US academic institutions will be able to license and use D2K for free by downloading from alg.ncsa.uiuc.edu Private Sector Partners who have provided funding for projects related to D2K will be able to license and use D2K for free Private Sector Partners who have not provided funding will be able to license and use D2K for a discounted fee Contact John McEntire Office of Technology Management 308 Ceramics Building, MC-243 105 South Goodwin Avenue Urbana, Illinois 61801-2901 (217) 333-3715 jmcentir@uiuc.edu

Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 D2K –

Similar presentations

Presentation on theme: "Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 D2K –"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 D2K –

Similar presentations

Presentation on theme: "Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 D2K –"— Presentation transcript:

Similar presentations

About project

Feedback