Presentation on theme: "From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA University of Illinois, Urbana-Champaign."— Presentation transcript:
From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA University of Illinois, Urbana-Champaign
ALG Mission The specific mission of the Automated Learning Group is: To collaborate with researchers to develop novel computer methods and the scientific foundation for using historical data to improve future decision making To work closely with industrial, government, and academic partners to explore new application areas for such methods, and To transfer the resulting software technology into real world applications
Required Effort for each KDD Step Arrows indicate the direction we want the effort to go.
Three Primary Paradigms Predictive Modeling – supervised learning approach where classification or prediction of one of the attributes is desired. –Classification is the prediction of predefined classes e.g. Naive Bayesian, Decision Trees, and Neural Networks –Regression is the prediction of continuous data e.g. Neural Networks, and Decision (Regression) Trees Discovery – unsupervised learning approach for exploratory data analysis. –e.g. Association Rules, Link Analysis, Clustering, and Self Organizing Maps Deviation Detection – identifying outliers in the data. –e.g. Visualization
Provides scalable environment from the Desktop to Web Services Employs a visual programming system for data/work flow paradigm Provides capability to build custom applications Provides capability to access data management tools Contains data mining algorithms for prediction and discovery Provides data transformations for standard operations Integrated environment for models and visualization Supports an extensible interface for creating ones own algorithms Provides access to distributed computing capabilities D2K- Framework for Data Analysis
D2K Components D2K Infrastructure Itinerary Execution engine D2K-Driven Applications Applications that make use of the D2K Infrastructure Toolkit is a D2K-Driven app D2K Server Special kind of D2K-Driven app Wraps the infrastructure to provide remote itinerary and module execution Used by the Toolkit to distribute module execution D2K Web Service Provides a generic programmatic interface for executing itineraries Communicates with D2K Servers over socket connections using D2K Specific protocols.
D2K Streamline (D2K SL) Provides step by step interface to guide user in data analysis Supports return to earlier steps to run different parameters Uses the D2K infrastructure transparently Uses same D2K modules Provides way to capture different experiments Define templates that can be reused in different experiments
D2K Web Service Architecture Any web enabled client can connect to and use the D2K Web Service by sending SOAP messages over HTTP. Itineraries and modules are stored on the web service machine and loaded over the network by the D2K Servers. Job results are also stored in the web service tier. – Results are returned to clients upon request. A relational database is used by the web service to lookup accounts, itineraries, servers, and jobs. Remote D2K Servers handle itinerary processing. If possible, modules should load any data from remote locations.
Prediction Industrial Manufacturer Computed customer buying propensities Achieved 25% conquest customer sales lift by executing directed cross/upsell resulting in $65 million in incremental revenue Discovery Automotive manufacturer Identified patterns of inappropriate warranty work in dealer channel Targeted $200M+ of potentially unnecessary annual expense Monitoring Department store retailer Watched POS transaction flow for unusual variations Deterred inappropriate behavior and fraudulent transactions Resulted in savings of over $125 million Creating Customer Value
Applications Examples Harris A. Lewin explains that Evolution Highway allows one to look "... at the whole genome at once - multiple chromosomes across multiple species. The insights wouldn't have come so quickly if we couldn't throw the data at this framework from NCSA. Nicholas M. Ball, Robert J. Brunner, Adam D. Myers, and David Tcheng, Robust Machine Learning Applied to Astronomical Data Sets. I. Star-Galaxy Classification of the Sloan Digital Sky Survey DR3 Using Decision Trees, The Astrophysical Journal, Vol. 650, Part 1, Pages 497–509, 2006 Comparative Genomics Science, Vol. 309, Issue 5734, Pages 613-617, 22 July 2005 Music Analysis J. Stephen Downie, The Scientific Evaluation of Music Information Retrieval Systems: Foundations and Future, Computer Music Journal, Vol. 28, No. 2, Pages 12-23 Summer 2004 Astronomy
RiverGlass NCSA D2K- Lineage 199619971998199920002001200220032004 RiverGlass Detect 20052006200720082009 T2K / ThemeWeaver Full Multi-language D2K / Data to Knowledge D2K Streamline I2K / Image to Knowledge M2K / Music to Knowledge MAIDS / Mining Alarming Incidents from Data Streams RiverGlass Recon Interface Fed.Query InferenceEng. WebAcquire StreamMining Audio Mining ImageMining TextMining DataMining Visualization Multimedia Sensors/RFID Music Analysis MotionMining Sensors/RFID Multimedia MotionMining GeoSpatial Future Research, Technology, Applications Engagements F100 Insurance F100 EquipMfg F100 CommMfgF100 RetailerF100 EquipMfg(2)F100 AutoMfg(2) F100 CommMfg(2) F100 Retailer F100 AircraftMfg F100 EquipMfgF100 RetailerF100 Oil CoF100 InsuranceF100 EquipMfgStateAgcyF100 AgResearchF100 EquipMfgHigher EducF100 CommMfg(2)F100 InsuranceF100 EmergPlanF100 CommMfg(2)GovTechLawEnforcementFedl AgcyEmergMgmtFedl AgcyHigher EducGovTechFedl SIFedl Agcy LawEnforcementF500 InsuranceLawEnforcementF100 Oil Co GeoSpatial One Llama Media One Llama RiverGlass, Inc.
D2K ToolKit 1.Workspace 2.Resource Panel 3.Modules 4.Models 5.Itineraries 6.Visualization s 7.Generated Visualization s 8.Generated Models 9.Component Information 10.Toolbar 11.Console
D2K Basic Set of D2K Modules to perform data mining techniques –Prediction Decision Trees –C4.5 Decision Tree, Continuous Decision Tree, SQL Rain Forest Decision Tree Naïve Bayesian Classification and SQL Naïve Bayesian Classification Neural Networks –Discovery Rule Association –Apriori, FP Growth, Htree Clustering –Hierarchical Agglomerative, Kmeans, Coverage, etc. Includes visualizations for many of the modeling approaches Includes a set of data transformations –Attribute selection, binning, filtering, attribute construction Includes optimization strategy for searching parameter space
D2K Modules Input Module: Loads data from the outside world. –Flat files, database, etc. Data Prep Module: Performs functions to select, clean, or transform the data –Binning, Normalizing, Feature Selection, etc. Compute Module: Performs main algorithmic computations. –Naïve Bayesian, Decision Tree, Apriori, FP Growth, etc. User Input Module: Requires interaction with the user. –Data Selection, Input and Output selection, etc. Output Module: Saves data to the outside world. –Flat files, databases, etc. Visualization Module: Provides visual feedback to the user. –Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D Scatterplot, 3D Surface Plot
D2K Module Icon Description Module Progress Bar Appears during execution to show the percentage of time that this module executed over the entire execution time. It is green when the module is executing and red when not. Input Port Rectangular shapes on the left side of the module represent the inputs for the module. They are colored according to the data type that they represent Properties Symbol If a P is shown in the lower left corner of the module, then the module has properties that can be set before execution. Output Port Rectangular shapes on the right side of the module represent the outputs for the module. They are colored according to the data type that they represent.
SEASR: Research, Development, & Technology Transfer Model
SEASR: The Data Problem Structured Vs. Unstructured 1999 GIGABYTES Cave paintings, Bone tools 40,000 BCE Writing 3500 BCE 0 C.E. Paper 105 Printing 1450 Electricity, Telephone 1870 Transistor 1947 Computing 1950 Internet (DARPA) Late 1960s The Web 1993 20%StructuredData 80% Unstructured Data Today, 80% of business is conducted on unstructured information – Today, 80% of business is conducted on unstructured information – Gartner Group 80% of the information needed is in the Open Source – 80% of the information needed is in the Open Source – NIA Workers spend 80% of the time gathering information – Workers spend 80% of the time gathering information – STIC, EMF www.fastsearch.com
SEASR Software Environment for the Advancement of Scholarly Research (SEASR) –addresses the challenges of transforming information into knowledge by constructing the software bridges that are required to move from the unstructured and semi- structured data world to the structured data world. –aims to make collections more useful by integrating two well-known research and development frameworks NCSAs Data-To-Knowledge (D2K) and IBMs Unstructured Information Management Architecture (UIMA) into an easily usable environment that researchers in any discipline can easily learn and adapt for their own unstructured data analysis.
SEASR: Architecture SEASRs advanced informatics tools will expand the technical capabilities of what is now available in the field by: connecting data sources that are currently incompatible, whether due to different formats or protocols offering all project components as open source, to enable users to modify and add to tools allowing users to write analytic engines in their programming language of choice installing on all hardware footprints, so that the tools can be brought to data sets where they are housed creating a repository for components that will support sharing and publishing among users enabling scalability so that components may run on a large variety of hardware footprints, including shared memory processors and clusters
Create by Anthony Don at http://www.cs.umd.edu/hcil/textvis/featurelens/. FeatureLens: n-gram patterns
Getting the Band Together June 2007 – Band formation –Project start date –More use ideas and framework discussions December – First gig –Framework and data app demonstration Vocals - Research Technology –John Unsworth, Stephen Downie, Tim Wentling –Dan Roth, Jiawei Han, Kevin Chang, Cheng Xiang Zhai Percussions & Bass - SEASR Development –Loretta Auvil, Tara Bazler, Duane Searsmith, Andrew Shirk, Students Lead – Designers/Developer/Applications Areas –Humanities – M2K, Nora/Monk and Others (we heard about yesterday/today)) Need Groupies! (Advisors, Researchers, Developers, and Application Drivers) – Loretta Auvil
SEASR: How can I participate? Collaborate on application development or ontology creation Contribute to component development for analytics or data access Participate in visualization and UI design Serve as an advisor Contact Loretta Auvil (email@example.com)
SEASR Engineering Knowledge for the Humanities Thank You