Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Collaborative Filtering Intelligent Information Retrieval and the Grid Friday 11 October 2002 William H. Hsu Laboratory for Knowledge Discovery in Databases Department of Computing and Information Sciences Kansas State University This presentation is:
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Acknowledgements Kansas State University Lab for Knowledge Discovery in Databases –Graduate research assistants: Haipeng Guo Roby Joehanes –Other grad students: Prashanth Boddhireddy, Siddharth Chandak, Ben B. Perry, Rengakrishnan Subramanian –Undergraduate programmers: James W. Plummer, Julie A. Thornton Joint Work with –KSU Bioinformatics and Medical Informatics (BMI) group: Sanjoy Das (EECE), Judith L. Roe (Biology), Stephen M. Welch (Agronomy) –KSU Microarray group: Scot Hulbert (Plant Pathology), J. Clare Nelson (Plant Pathology), Jan Leach (Plant Pathology) –Kansas Geological Survey, Kansas Biological Survey, KU EECS Other Research Partners –NCSA Automated Learning Group (Michael Welge, Tom Redman) –University of Manchester (Carole Goble, Robert Stevens) –The Institute for Genomic Research (John Quackenbush, Alex Saeed) –International Rice Research Institute (Richard Bruskiewich)
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Overview Filtering –Collaborative filtering (CF) and relatives –Application to intelligent information retrieval (IR) Computational Grids –High-Performance Computing (HPC) services Scientific data, metadata (ontologies, specifications), documentation Software tools (source codes, application servers) Experimental results –Grid initiatives: TeraGrid (USA), eScience (UK, EBI) Challenge: Personalization of Services Application: Bioinformatics Methodology: Learning Relational Probabilistic Models –User modeling and collaborative filtering (CF) –DESCRIBER system: integrative CF for computational genomics Current Research and Open Problems
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Cross-Selling (based upon Market Basket Analysis) Collaborative Recommendation © 2002 Amazon.com, Inc. Collaborative Filtering in Action: Amazon.com [1]
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Collaborative Filtering in Action: Amazon.com [2] © 2002 Amazon.com, Inc. Classification and Regression based upon Historical Customer Data Explanation from Recommender (Decision Support) System
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Filtering and Recommendation Approaches Collaborative –Collect: recorded decisions (actions) of user(s) –Infer: preferences of user(s) –Model: associational relationships among entities (e.g., purchases) –Use to: recommend similar decisions to users in similar context Structural –Collect: recorded decisions (actions) of user(s) –Infer: preferences of user(s) –Model: causal relationships among entities (e.g., use cases) –Use to: make recommendation and explain Content-Based: Driven by Key Word / Phrase Collective: Driven by Consensus, Stochastic Mixture Model (e.g., “Swarm Intelligence”, Ant Colony Optimization)
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( ThemeScapes © 1999 SPIRIX software news stories from the WWW in 1997 A Filtering Problem: Text Mining for Information Retrieval (IR)
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Another Filtering Application: Commercial Fraud Monitoring
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Stages of Data Mining and Knowledge Discovery in Databases Adapted from Fayyad, Piatetsky-Shapiro, and Smyth (1996)
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( NCSA D2K: Visual Programming System for Rapid Application Development in KDD Data to Knowledge (D2K) © 2002 NCSA
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( NCSA D2K Workflow: Decision Support in Insurance Pricing Hsu, Welge, Redman, Clutter (2002) Data Mining and Knowledge Discovery, 6(4):
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Computational Grids [1]: High-Performance Distributed Computing What is The Grid? –Infrastructure: Distributed Processing, Networks, Software –Paradigm for Very Large-Scale Scientific Computing End Users of The Grid – Adapted from Goble (2002) –Providers Tool builders Systems/network administrators, service providers, etc. –Researchers Scientific discipline – e.g., Biology Computational Science and Engineering (CSE) – e.g., Bioinformatics Patent Intelligence! –“End users” Developers: e.g., pharmaceutical Medical doctors, patients
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Computational Grids [2]: Personalization of Services What Services? –High-Performance Computing (HPC) facilities Compute clusters (Beowulf, NT, etc.) Massively distributed networks –Software –Scientific data servers Metadata –Ontologies: Definitional Data Models (cf. Semantic Web) –Service Type Directory Dynamic Design of Workflows – myGrid, Goble et al. (2002) Challenge: Personalization –Intelligent Filtering Approach: User Modeling –“Users Who Used (Your) Specified Resources Also Used…”
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Domain-Specific Repositories Experimental Data Source Codes and Specifications Data Models Ontologies Models Data Entity and Source Code Repository Index for Bioinformatics Experimental Research Personalized Interface Domain-Specific Collaborative Filtering New Queries Learning and Inference Components Historical Use Case & Query Data Decision Support Models Users of Scientific Document Repository Interface(s) to Distributed Repository Example Queries: What experiments have found cell cycle-regulated metabolic pathways in Saccharomyces? What codes and microarray data were used, and why? DESCRIBER: An Experimental Intelligent Filter
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Module 2 Learning & Validation of Bayesian Network Models for Use Cases Module 4 Learning & Validation of Bayesian Network Models for MAGE Data & Codes Relational Models of MAGE Data Module 1 Intelligent Collaborative Filtering Front-End Data Historical Use Case & Query Data Personalized Interface Module 5 MAGE Data Model User Estimation of Constraint Parameters Graphical Models of Use Cases Module 3 Constrained Models of Use Cases New Queries DESCRIBER [1]: Overview
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Intelligent Collaborative Filtering Front-End Personalized Interface Relational Models of (Domain-Specific) Data Constrained Models of Use Cases Relational Probabilistic Model Constraint Selector Integrated Reasoning Component: XML Validator and Constraint Checker Constraints on Repository Content Response to User New Query from User Module 1 DESCRIBER [2]: Collaborative Filtering Module
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Computational Genomics and Microarray Data Mining Treatment 1 (Control) Treatment 2 (Pathogen) Messenger RNA (mRNA) Extract 1 Messenger RNA (mRNA) Extract 2 cDNA DNA Hybridization Microarray (under LASER)
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Publication (e.g., PubMed) Source (e.g., Taxonomy) Gene (e.g., GenBank) Experiment SampleHybridizationArray Normalization/ Discretization Data Components of A Microarray Experiment: Hybridization
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Computational Workflows (e.g., myGrid) Experimental Services & Metadata (Mage-ML XML) Gene Expression Model Pathway & Network Learning Specification Data Preprocessing Specification Parameter Learning Specification Model Analysis Specification Discretization Use Case Data Mining Use Case Feature Selection Specification Validation (e.g., Bootstrap) Use Case Components of A Microarray Experiment: Computational Gene Expression Modeling
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Graphical Models of Probability for Collaborative Filtering (CF) Goal: Estimate Filtering: r = t –Intuition: infer current state from observations –Applications: signal identification –Variation: Viterbi algorithm Prediction: r < t –Intuition: infer future state –Applications: prognostics Smoothing: r > t –Intuition: infer past hidden state –Applications: signal enhancement CF Tasks –Plan recognition by smoothing –Prediction cf. WebCANVAS – Cadez et al. (2000) Murphy (2002)
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Tools for Building Graphical Models Commercial Tools: Ergo, Netica, TETRAD, Hugin Bayes Net Toolbox (BNT) – Murphy (1997-present) –Distribution page –Development group Bayesian Network tools in Java (BNJ) – Hsu et al. (1999-present) –Distribution page –Development group –Current (re)implementation projects for KSU KDD Lab Continuous state: Minka (2002) – Hsu, Guo, Perry, Boddhireddy Formats: XML BNIF (MSBN), Netica – Guo, Hsu Space-efficient DBN inference – Joehanes Bounded cutset conditioning – Chandak
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( Learning Environment Specification Fitness (Inferential Loss) [B] Parameter Estimation [A] Structure Learning G = (V, E) Graph Component of BN D: Data (User, Microarray) B = (V, E, ) BN with Probabilities D val (Model Validation by Inference) G1G1 G2G2 G3G3 G4G4 G5G5 G1G1 G2G2 G3G3 G4G4 G5G5 Experimenters’ Workbench
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( References [1]: Intelligent Filtering, IR, and KDD Intelligent Filtering –Taxonomy of Filtering Approaches: Rocha (2001) –Microsoft Research: Cadez et al. (1999), Heckerman and Meek (2002), Kadie (2002) –Technical report: survey, Hsu (2002) –NCSA Automated Learning Group Machine Learning, Data Mining, and Knowledge Discovery –K-State KDD Lab: literature survey and resource catalog (2002) –Bayesian Network tools in Java (BNJ): Hsu, Guo, Joehanes, Perry, Thornton (2002) –Machine Learning in Java (BNJ): Hsu, Louis, Plummer (2002)
Kansas State University Department of Computing and Information Sciences Kansas State University KDD Lab ( References [2]: The Grid and Bioinformatics The Grid –United Kingdom eScience Initiative: Taylor et al. (2002) –Access Grid: Foster and Kesselman (1999), Foster (2002) –NSF NPACI lecture: Reed (10 Apr 2002) Bioinformatics –European Bioinformatics Institute Tutorial: Brazma et al. (2001) –Hebrew University: Friedman, Pe’er, et al. (1999, 2000, 2002) –K-State BMI Group: literature survey and resource catalog (2002) Kohavi (1998): “Crossing the Chasm”