Presentation on theme: "Page 1 Bayesian Automation Tool Presentation for MIT Bioinformatics & Metabolic Engineering Laboratory 16 September 2004 Copyright Poulin Holdings, LLC."— Presentation transcript:
Page 1 Bayesian Automation Tool Presentation for MIT Bioinformatics & Metabolic Engineering Laboratory 16 September 2004 Copyright Poulin Holdings, LLC and Hugin Expert A/S
Page 2 Overview I. Brief Project Overview/ Members/ History II. Hugin Discussion III. Poulin Discussion IV. Parsing Data V. Conclusion and Demos
Page 3 Project Overview Pattern and Prediction focused tool Bayesian non-linear approach Limited prior knowledge, let patterns emerge from data Allows observation and discovery Simple tool to reduce complexity (ease of use, automation, integration) Tool for all data types (text, images, sound…)
Page 4 Bayes Theorem Rev. Thomas Bayes (1702-1761), an 18th century priest from England The theorem, as generalized by Laplace, is the basic starting point for inference problems using probability theory as logic – assigns degree of belief to propositions
Page 5 Bayesian Technology Probablistic graphical models Model based approach to decision support Compact and intuitive graphical representation Sound & coherent handling of uncertainty Reasoning and decision making under uncertainty Bayesian networks and influence diagrams
Page 6 Bayesian Applications Medicine – forensic identification, diagnosis of muscle and nerve diseases, antibiotic treatment, diabetes advisory system, triage (AskRed.com) Software – software debugging, printer troubleshooting, safety and risk evaluation of complex systems, help facilities in Microsoft Office products Information Processing – information filtering, display of information for time-critical decisions, fault analysis in aircraft control Industry – diagnosis and repair of on-board unmanned underwater vehicles, prediction of parts demand, control of centrifugal pumps, process control in wastewater purification. Economy – prediction of default, credit application evaluation, portfolio risk and return analysis Military – NATO Airborne Early Warning & Control Program, situation assessment Agriculture – blood typing and parentage verification of cattle, replacement of milk cattle, mildew management in winter wheat
Page 8 Chris Poulin Project Lead Specific research interest in context analysis, epistemological issues in informatics. More information at http://www.poulinholdings.com
Page 9 Hugin Expert A/S? The market leader for more than a decade Highly skilled researchers and developers Has strategic cooperation internationally Has a clear strategy for maintaining its leadership as tool and technology provider Part of the worlds largest Bayesian research groups Has experience from numerous, large-scale, international R&D projects
Page 10 Client List USA Hewlett-Packard Intel Corporation Dynasty DrRedDuke Xerox Lockheed Martin NASA/Johnson Space Center Boeing Computer Service USDA Forest Service Information Extraction & Transport Inc. Pacific Sierra Research Price Waterhouse Swiss Bank Corporation Bellcore ISX Corporation Lam Research Corporation Orincon Corporation Integrate IT Charles River Analytics Northrop Grumman CHI Systems Inc Voyan Technology Los Alamos National Laboratory Rockwell Science Center Citibank Perkin Elmer Corporation Inscom Honeywell Software Initiative Aragon Consulting Group Raytheon Systems Company Kana Communications Sandia National Laboratories GE Global Research Westhollow Technology Center Great Britain Rolls-Royce Aerospace Group Philips Research Laboratories USB AG Motorola Defence Research Agency Nuclear Electric Plc Marconi Simulation Lucas Engineering & Systems ltd Lloyd´s Register BT Laboratories Brown & Root Limited Silsoe Research Institute Aon Risk Consultants Railtrack Shell Global Solutions Germany Siemens AG Volkswagen AG DaimlerChrysler AG GSF Medis Reutlingen Kinderklinik France PGCC Technologie Protectic Objectif Technologies Usinor Canada Decision Support Technologies Italy ENEA CRE Casassia C.S.E.L.T. Israel IBM Haifa Research Laboratory Australia Department of Defence, DSTO National Australian Bank Netherlands Shell International E&P Japan Sumitomo Metal Industries Dentsu Inc. Scandinavia Defence Research Agency Danish Defence Research Establishm. Aalborg Portland Danish Agricultural Advisory Center COWI FLS Automation Judex Datasystemer AON Denmark ABB Nykredit Swedpower South Africa CSIR
Page 12 Hugin Software? Product maturity and optimisation produce the worlds fastest Bayesian inference engine State-of-the-art capabilities based on internal and external research and development Practical experience and theoretical excellence combined form the basis of further product refinement High-performance and mission critical systems in numerous areas are constructed using Hugin software
Page 13 Implementation Cause and effect relations represented in an acyclic, directed graph Strengths of relations are encoded using probabilities Compute probabilities of events given observations on other events Fusion of data and domain knowledge Analyse results using techniques like conflict & sensitivity analysis
Page 15 Bayesian Expert Systems Induce structure of the graphical representation Fusion of data & expert knowledge Estimate parameters Fusion of data & expert knowledge Generative distribution
Page 16 Technology Summary A compact and intuitive graphical representation of causal relations Coherent and mathematically sound handling of uncertainty and decisions Construction and adaptation of Bayesian networks based on data sets Efficient solution of queries against the Bayesian network Analysis tools such as Data conflict, Explanation, Sensitivity, Value of information analysis
Page 17 What Does This Do For You? Reasoning and decision making under uncertainty supporting Diagnosis Prediction Process analysis and supervision Filterting & classification Control Troubleshooting Predictive maintenance …
Page 18 General purpose decision support Hugin Explorer Hugin graphical user interface Hugin Developer Hugin graphical user interface Hugin decision engine APIs (C, C++, Java) and ActiveX server Troubleshooting Hugin Advisor A suite of tools for troubleshooting Data mining Hugin Clementine Link Hugin Products
Page 20 Vision To create an application that would provide automation for The Hugin Decision Engine. Focus on main Bayesian Inference capabilites Build automation capabile command line tool Build data parser for formating of structured/unstructured data Divide problem space and build meta-database Integrate with Hugin GUI for human based knowledge discovery
Page 21 Review of Bayesian Networks A Bayesian network consists of: A set of nodes and a set of directed edges between nodes The nodes together with the directed edges form a directed acyclic graph (DAG) Each node has a finite set of states Attached to each node X with parents there is a conditional probability table A knowledge representation for reasoning under uncertainty
Page 22 Methodology The Naive Bayes Model Structure, variables and states, Discretization using Principle of Maximum Entropy Parameter estimation using EM The Tree Augmented Naive Bayes Model Interdependence relations between information variables based on mutual information (extra step compared to NBM) Model update by adding new nodes as in NBM Value of information (variables and cases) Evidence sensitivity analysis (what-if)
Page 23 Functionality Model construction - build Naive Bayes Model orTree Augmented NBM Model update - Add additional information variables Inference - compute probability of target given evidence What-if analysis - robustness of probabilities Value of Information analysis Which case is most informative Which observation is most informative Command-line interface AnalysisModelParser
Page 24 Features A custom application template. A prototype for construction of a Naive Bayes Model Updating a Naive Bayes Model Construction of a Tree Augmented Naive Bayes Model Inference base on a case What-if sensitivity analysis (a single piece of evidence) Value-of-information analysis (cases and observations) Error handling and tracing have been kept at a minimum. Implemented in C++ using Hugin Decision Engine 6.3 Runs on Windows 2k, Linux Redhat, Sun Solaris. Program documentation in HTML.
Page 25 Data Sample Source data are from the Global Summary of the Day (GSOD) database archived by the National Climatic Data Center (NCDC). Used average daily temperature (of 24 hourly temperature readings) in 145 US cities measured from January 1, 1995 to December 29, 2003. Data of 3,255 cases split into subsets for learning, updating, cases, and case files. learning: 2000 cases update: 1000 cases cases: 10 cases case files: 245 cases 2,698 missing values out of 471,975 entries: 0.006% missing values.
Page 26 Discretization Measures on average daily temperatures are continuous by nature. Continuous variables can be represented as discrete variables through discretization. Determining intervals: how many, width, equally sized,... ? We discretize using the principle of maximum entropy, but can easily make equidistant ( uniform ) discretization.
Page 27 Discretization Cont. Entropy can be considered as a measure of information. Obtain uninformative distribution under current information. Principle of Maximum Entropy By choosing to use the distribution with the maximum entropy allowed by our information, the argument goes, we are choosing the most uninformative distribution possible. To choose a distribution with lower entropy would be to assume information we do not possess; to choose one with a higher entropy would violate the constraints of the information we do possess. Thus the maximum entropy distribution is the only reasonable distribution. Discretize variables to have uniform distribution based on data.
Page 28 Model Specification A Bayesian network consists of a qualitative part, the graph structure (DAG G). quantitative part, the the conditional probability distributions (P). Model specification consists of two parts. A Bayesian network N is minimal if and only if, for every node X and for every parent Y, X is not independent of Y given the other parents of X
Page 29 Naive Bayes Model A well-suited model for classification tasks and tasks of the following type An exhaustive set of mutex hypotheses h1; : : : ; hn are of interest Measures on indicators I1; : : : ; In to predict hi The Naive Bayes Model h1; : : : ; hn are represented as states of a hypothesis variable H Information variables I1; : : : ; In are children of H The fundamental assumption is that I1; : : : ; In are pairwise independent when H is known. Computationally and representationally a very efficient model that provides good results in many cases.
Page 30 Naive Bayes Model Cont. The Naive Bayes Model in more detail Let the possible hypotheses be collected into one hypothesis variable H with prior P(H). For each information variable I, acquire P(I | H) = L(H | I). For any set of observations calculate: The posterior is where The conclusion may be misleading as the assumption may not hold
Page 31 email@example.com Monday, May 11 th, 2004 NBM Model Construction application -nbm [-verbose] This command builds a NBM model from the data contained in with as the hypothesis variable. All variable will have a maximum of states. As many as iterations of the EM algorithm will be performed. The model constructed is saved in file "nmb.net", which can be loaded into Hugin Graphical User Interface for inspection. Example: application -nmb model.dat MDWASHDC 2 1
Page 32 Tree Augmented Naive Bayes Model Let M be a Naive Bayes Model with hypothesis H and information variables I = fI1; : : : ; Ing We can use I(Ii; Ij j H) to measure the conditional dependency between two information variables Ii; Ij conditional on H. After computing I(Ii; Ij j H) for all Ii; Ij, we use Kruskals algorithm to find a maximum weight spanning tree T on I: The edges of T are directed such that no variable has more than two parents (H and one other I). Complexity of inference becomes polynomial in the number of information variables.
Page 33 TAN Model Construction application -tan [-verbose] This command builds a Tree-Augmented Naive Bayes model (TAN) from the data contained in with as the hypothesis variable. All variable will have a maximum of states. As many as iterations of the EM algorithm will be performed. The model constructed is saved in file "tan.net", which can be loaded into Hugin Graphical User Interface for inspection. Example: application -tan model.dat MDWASHDC 2 1
Page 34 Model Updates application -update [-verbose] This command updates a model with data contained in. is the hypothesis variable of the model stored in. Cities in the data not represented in the original model will be added to the model as children of the hypothesis variable (no structure between information variables is added). The data file should contain measures on all cities (old and new). All new variables will have a maximum of states. As many as iterations of the EM algorithm will be performed. The updated model is saved in file "update.net", which can be loaded into Hugin Graphical User Interface for inspection. Example: application -update update.dat model.net MDWASHDC 2 1
Page 35 Parameter Estimation Parameter learning is identification of the CPTs of the Bayesian network. theoretical considerations, database of cases, subjective estimates. The CPTs are constructed based on a database of cases D = fc1; : There may be missing values in some of the cases indicated by N/A. The CPTs are learned by maximum likelihood estimation: where n(Y = y) is the (expected) number of cases for which Y = y.
Page 37 Parameter Estimation Prior (domain expert) knowledge can be exploited. Experience is the number of times pa(Xi) = j has been observed. Experience count is positive number j > 0. Also used to turn on/off learning. Prior knowledge is used both to speed up and guide learning in search of global optimum Expected counts used when values are missing. Including parameters not appearing in the data. The EM algorithm is an iterative procedure using the current estimate of the parameters as the true values. In the first run the initial content is used.
Page 38 Inference application -inference [-verbose] This command performs inference in, which has as hypothesis variable. The posterior distribution in is displayed for the case stored in the file Example: application -inference nbm.net MDWASHDC case.hcs
Page 39 VOI in Bayesian Networks How do we perform value of information analysis without specifying utilities? The reason for acquiring more information is to decrease the uncertainty about the hypothesis. The entropy is a measure of how much probability mass is scattered around on the states (the degree of chaos). Thus, where Entropy is a measure of randomness. The more random a variable is, the higher entropy its distribution will have.
Page 40 Value of Information If the entropy is to be used as a value function, then We want to minimize the entropy
Page 41 What is the expected most informative observation ? A measure of the reduction of the entropy of T given X. The conditional entropy is Let T be the target, now select X with maximum information gain Variables Value of Information
Page 42 Variables Value of Information Assume we are interested in B, i.e. B is target: We are interested in observing variable Y with most information on B: We select to observe and compute: Thus,
Page 43 Variables VOI Command Line application -voivariables [-verbose] This command performs a value-of-information analysis on each non- observed city given the observations in relative to. That is, for each unobserved variable in, a measure of how well the variable predicts is displayed. Example: application -voivariables nbm.net MDWASHDC case2.hcs
Page 44 Cases Value of Information Assume T is the target of interest and assume we have a database of cases D = fc1; : : : The uncertainty in T can be measured as H(T): A high value of H(T) indicates high uncertainty A low value of H(T) indicates low uncertainty Entropy for the binary case E(H) We compute H(T j c) for all cases c. The case c producing the lowest value of H(T j c) is considered most informative wrt. T.
Page 45 Cases VOI Command Line application -voicases [-verbose] This command performs a value-of-information analysis on each case in relative to. That is, for each case a measure of how well the case predicts is displayed. Example: application -voicases tan.net MDWASHDC cases.dat
Page 46 Evidence Sensitivity Analysis Let = f1; : : : ; ng be a set of observations and assume a single hypothesis h is of interest. What-if the observation i had not been made, but instead ? Involves computing P(h j [ f0ig n fig) and comparing results. This kind of analysis will help you determine, if a subset of evidence acts for or against a hypothesis.
Page 47 What-If Analysis What happens to the temperature inWashington, DC if the temperature in Austin, TX changes? Assume evidence = f1; : : : ; ng and let i be the measured temperature in Austin, TX We compute P(T = t j n fig [ f0ig) for all Myopic what-if analysis: change finding on one information marginal and monitor the change in probability of T
Page 48 What-If Analysis: Cases application -whatif This command performs a what-if analysis on each instantiated variable in the case file relative to. That is, the posterior distribution of each hypothesis (each state of the target variable) is displayed for each possible value of the observed variable. Example: application -whatif model.net MDWASHDC case.hcs
Page 49 What-If Analysis: Variables application -whatif This command performs a what-if analysis on relative to. That is, the posterior distribution of each hypothesis (each state of the target variable) is displayed for each possible value of the indicated variable. Example: application -whatif model.net MDWASHDC case.hcs TXAUSTIN
Page 50 Help application –help This command displays as simple help.
Page 52 Parser: Formatting the Data Hugin Data format is CSV format. Specific file extension is.hcs Structured Parsing: Weather Structured Parsing: Generic HTML Unstructured Parsing: HTML/Text
Page 53 Structured Parsing: Weather parse -weather Extracts the temperature data from the specified input files (any number of input files can be specified). The collected temperature data is saved to a file named. Data is now suitable for use as input to the Hugin learning algorithms.
Page 54 Structured Parsing: Generic HTML parse -struct-html [ ] parse -struct-url [ ] Commands extract a table from the specified HTML source. The first command uses a local file as the source, while the second command retrieves the source given by a URL. If the HTML contains more than one table, the first table that does not contain another table is output. If no table is found, nothing is output. If the table is sufficiently regular,'' the created file can be used as input to the Hugin learning algorithms. If the argument is given, the table is saved to a file with the given name. Otherwise, the table is written to standard output.
Page 55 Unstructured Parsing: HTML/Text parse -unstruct-html html-file> [ ] parse -unstruct-url [ ] Commands extracts text from a file, strips all HTML tags. Outputs variables to be used for non-interval Boolean classification. The first command uses a local file as the source, while the second command retrieves the source given by a URL. If the argument is given, the table is saved to a file with the given name. Otherwise, the table is written to standard output.
Page 56 Ongoing Development OCT.1 Release, will include: ODBC Parsing - 1.0 Release Boolean Classification Parsing - 1.0 Release Versions 1.x: XML Parsing Higher level commands (parse + model construction) Hierarchical Modeling (latent variable definition) Industry Specific Data Parsers Incidence valuation (Context Engine Example) Value of Models: "Digital Epistemology"
Page 57 Conclusion: Applications areas Genomics / Protemics: Cellular differentiation Gene suppression and activation modeling Time sequence expression Other complex models of correlation (metabolic pathway modeling) Medical: Pathogen correlations Search Engines: Context awareness Email: SPAM classification API Pricing Optimization: Public/Private Markets Event Modeling: Disaster/Earthquake Prediction Physics: Image and sound analysis