Pavel B. Klimov Barry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI The next generation of identification tools:

Slides:



Advertisements
Similar presentations
Continued Psy 524 Ainsworth
Advertisements

Efficient modelling of record linked data A missing data perspective Harvey Goldstein Record Linkage Methodology Research Group Institute of Child Health.
Managerial Economics in a Global Economy
 Population multiple regression model  Data for multiple regression  Multiple linear regression model  Confidence intervals and significance tests.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Variance reduction techniques. 2 Introduction Simulation models should be coded such that they are efficient. Efficiency in terms of programming ensures.
Logistic Regression.
Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Chapter 4: Linear Models for Classification
Chapter 17 Overview of Multivariate Analysis Methods
PART 7 Constructing Fuzzy Sets 1. Direct/one-expert 2. Direct/multi-expert 3. Indirect/one-expert 4. Indirect/multi-expert 5. Construction from samples.
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Log-linear and logistic models Generalised linear model ANOVA revisited Log-linear model: Poisson distribution logistic model: Binomial distribution Deviances.
Log-linear and logistic models
CSE 300: Software Reliability Engineering Topics covered: Software metrics and software reliability Software complexity and software quality.
Statistical Background
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Review of Lecture Two Linear Regression Normal Equation
Module 32: Multiple Regression This module reviews simple linear regression and then discusses multiple regression. The next module contains several examples.
Marketing Research Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides.
Business Research Methods William G. Zikmund Chapter 24 Multivariate Analysis.
Objectives of Multiple Regression
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Quantitative Methods Heteroskedasticity.
بسم الله الرحمن الرحیم.. Multivariate Analysis of Variance.
Business Research Methods William G. Zikmund Chapter 24 Multivariate Analysis.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 16.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Eurostat Statistical matching when samples are drawn according to complex survey designs Training Course «Statistical Matching» Rome, 6-8 November 2013.
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
Linear Discriminant Analysis and Its Variations Abu Minhajuddin CSE 8331 Department of Statistical Science Southern Methodist University April 27, 2002.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
7.4 DV’s and Groups Often it is desirous to know if two different groups follow the same or different regression functions -One way to test this is to.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Correlation & Regression Analysis
Module III Multivariate Analysis Techniques- Framework, Factor Analysis, Cluster Analysis and Conjoint Analysis Research Report.
1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
D/RS 1013 Discriminant Analysis. Discriminant Analysis Overview n multivariate extension of the one-way ANOVA n looks at differences between 2 or more.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Logistic Regression Categorical Data Analysis.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Designs for Experiments with More Than One Factor When the experimenter is interested in the effect of multiple factors on a response a factorial design.
Chapter 14 Introduction to Regression Analysis. Objectives Regression Analysis Uses of Regression Analysis Method of Least Squares Difference between.
Sampling Design and Analysis MTH 494 LECTURE-11 Ossam Chohan Assistant Professor CIIT Abbottabad.
Computacion Inteligente Least-Square Methods for System Identification.
Slide 7.1 Saunders, Lewis and Thornhill, Research Methods for Business Students, 5 th Edition, © Mark Saunders, Philip Lewis and Adrian Thornhill 2009.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Logistic Regression: Regression with a Binary Dependent Variable.
BINARY LOGISTIC REGRESSION
Chapter 7. Classification and Prediction
Multivariate Analysis - Introduction
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Multivariate Statistics
Single Regression Equation
Generally Discriminant Analysis
Product moment correlation
LECTURE 09: BAYESIAN LEARNING
Text Categorization Berlin Chen 2003 Reference:
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Pavel B. Klimov Barry M. OConnor University of Michigan, Museum of Zoology, 1109 Geddes Ave., Ann Arbor, MI The next generation of identification tools: interactive programs incorporating multivariate models Context: The vast majority of interactive identification programs use a sequential approach to assign an unknown specimen to a known group. This algorithm works when the distinguishing characters do not have overlapping values. If the boundaries between taxa are overlapping, simultaneous (=probabilistic, matching) methods of identifications are more likely to lead to the correct assignment, but these methods usually require time-consuming measurements or experiments. We discuss how the sequential approach can be enhanced by multivariate statistics incorporated into this method. 1. INTRODUCTION Computer assisted interactive identification allows quick assignment of an unknown specimen to a known taxon with minimal costs in obtaining data and learning about the unknown. The number of characters used in the identification is substantially reduced compared to traditional taxonomic keys. For example, any of 128 taxa can be identified using only eight binary characters, or even fewer numeric or multistate characters. There are two major approaches to identification, sequential (=elimination, diagnostic) and simultaneous (=probabilistic, matching). In the sequential approach, only one character is used at each step of identification until the unknown specimen is assigned to a particular group. In the simultaneous approach, some or all characters are entered simultaneously, and the probability of group membership of the unknown specimen is calculated. The advantage of the sequential algorithm, particularly its multi-entry variant (=freedom to choose any character), is obvious when a taxon set is large and the taxa have distinct boundaries. At each step, taxa matching the unknown are retained and diagnostic characters for this subset are ordered according to their separating power. This algorithm has been implemented in a variety of interactive identification programs such as DELTA and Lucid that are widely used at present. In contrast, simultaneous methods usually require data obtained by time consuming measurements or experiments and are not that flexible in terms of the freedom of choosing characters, but are more likely to lead to the correct assignment if the boundaries between some or all taxa are overlapping. The situation when a data set is large and contains taxa that cannot be completely separated using qualitative or uni- or bivarite characters requires a combination of both methods of identification where each approach will handle the appropriate data. 2. MULTIVARIATE MODELS Multivariate statistics summarizes variation in many variables in many specimens in the form of a concise model that contains essential and comprehensive information about the groups and that has predictive power. We consider two multivariate techniques that are usually used to analyze intergroup differences: canonical variates analysis (CVA), and binomial logistic regression (LR). Both analyses handle metric and non-metric independent variables. A canonical variates function is a latent variable that is created as a linear combination of independent variables, CV = b1*x1 + b2*x bn*xn + c (1), where the b's are coefficients, the x's are independent variables, and c is a constant. If there are n groups, n-1 CV's are calculated. For assignment purposes, the estimated posterior probability of group membership is calculated, or, when multivariate normality of the independent variables is assumed, the value of CV can be equivalently used. Logistic regression models can be expressed as the following equation, P(0) = exp(b1*x1 + b2*x bn*xn + c)/(1+exp(b1*x1 + b2*x bn*xn + c)) (2), where P(0) the probability of an unknown specimen being taxon 0, other notations are the same as for CVA above. If P(0) exceeds 0.5, then the unknown belongs to taxon 0, otherwise to taxon 1. A great advantage of LR over CVA is that it is a direct posterior probabilities estimator, it calculates the class posterior probabilities without ever estimating the classes' individual density functions, which requires additional data (group means, prior probabilities, and the value of mean square within groups). 3. INCORPORATING THE MODELS IN THE SEQUENTIAL ALGORITHM Both (1) and (2) can be used in any sequential identification program, as a single character “Model classifies the unknown specimen to” with the character states “group 1, group2,…group n”. The user, however, should be asked simply to enter measurements or observations, x1, x2, …, xn, then the Bayesian probabilities associated with being in either group are calculated, and the greater of these probabilities is used to classify the specimen. Implementation of the new data type will require some adjustment in the internal logic of an identification program. In the general case, there are some characters in the identification matrix that can separate a subset of taxa without using methods of multivariate models. These characters, whether they are binary, multistate, or variable, should be given more weight compared to the complex character generated by a multivariate model. The latter also should be coded only for the subset of taxa included in the model, and this character for the other taxa should be coded as "missing". Because a multivariate model may contain characters that are used elsewhere in the identification matrix, these matching characters should be cross- referenced. Results The most optimal way of identification when a data matrix contain both both discrete and overlapping groups is to use combined sequential and probabilistic strategies for appropriate data. Canonical variates and logistic regression models can be used in the context of the sequential approach to calculate posterior probabilities and to classify the unknown specimen. Research supported by NSF DEB (PEET) and the USDA (CSREES # ).