PROBABILISTIC ASSESSMENT OF THE QSAR APPLICATION DOMAIN Nina Jeliazkova 1, Joanna Jaworska 2 (1) IPP, Bulgarian Academy of Sciences, Sofia, Bulgaria (2)

Slides:



Advertisements
Similar presentations
Aggregating local image descriptors into compact codes
Advertisements

Anthony Greene1 Simple Hypothesis Testing Detecting Statistical Differences In The Simplest Case:  and  are both known I The Logic of Hypothesis Testing:
Major Operations of Digital Image Processing (DIP) Image Quality Assessment Radiometric Correction Geometric Correction Image Classification Introduction.
A Comprehensive Study on Third Order Statistical Features for Image Splicing Detection Xudong Zhao, Shilin Wang, Shenghong Li and Jianhua Li Shanghai Jiao.
Topic 6: Introduction to Hypothesis Testing
Hypothesis testing Week 10 Lecture 2.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
THE MEANING OF STATISTICAL SIGNIFICANCE: STANDARD ERRORS AND CONFIDENCE INTERVALS.
Estimation from Samples Find a likely range of values for a population parameter (e.g. average, %) Find a likely range of values for a population parameter.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Evaluating Hypotheses
CS Pattern Recognition Review of Prerequisites in Math and Statistics Prepared by Li Yang Based on Appendix chapters of Pattern Recognition, 4.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Statistical Shape Models Eigenpatches model regions –Assume shape is fixed –What if it isn’t? Faces with expression changes, organs in medical images etc.
The Multivariate Normal Distribution, Part 1 BMTRY 726 1/10/2014.
Radial Basis Function Networks
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
Estimation of the Uncertainty of the Robertson and Wride Model for Reliability Analysis of Soil Liquefaction C. Hsein Juang and Susan H. Yang Clemson University.
Statistical inference: confidence intervals and hypothesis testing.
Copyright © 2012 by Nelson Education Limited. Chapter 8 Hypothesis Testing II: The Two-Sample Case 8-1.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
6 - 1 Basic Univariate Statistics Chapter Basic Statistics A statistic is a number, computed from sample data, such as a mean or variance. The.
STAT 5372: Experimental Statistics Wayne Woodward Office: Office: 143 Heroy Phone: Phone: (214) URL: URL: faculty.smu.edu/waynew.
Gaussian process modelling
Review of methods to assess a QSAR Applicability Domain Joanna Jaworska Procter & Gamble European Technical Center Brussels, Belgium and Nina Nikolova.
Statistical Techniques I EXST7005 Review. Objectives n Develop an understanding and appreciation of Statistical Inference - particularly Hypothesis testing.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 14 Sampling Variation and Quality.
Chapter Nine Copyright © 2006 McGraw-Hill/Irwin Sampling: Theory, Designs and Issues in Marketing Research.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Chapter 15 Data Analysis: Testing for Significant Differences.
Normal Distributions Z Transformations Central Limit Theorem Standard Normal Distribution Z Distribution Table Confidence Intervals Levels of Significance.
Measures of Dispersion & The Standard Normal Distribution 2/5/07.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
12.1 Heteroskedasticity: Remedies Normality Assumption.
AN APPROACH TO DETERMINE THE APPLICATION DOMAIN OF GROUP CONTRIBUTION MODELS Nina Jeliazkova 1 Joanna Jaworska 2, (2) Central Product Safety, Procter &
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
SAR vs QSAR or “is QSAR different from SAR”
1 Accuracy and Precision Notes Chemistry 1. 2 Uncertainty in Measurements There is no such thing as a perfect measurement! All measurements have a degree.
Chapter Thirteen Copyright © 2004 John Wiley & Sons, Inc. Sample Size Determination.
Review - Confidence Interval Most variables used in social science research (e.g., age, officer cynicism) are normally distributed, meaning that their.
Scientific Measurement Measurements and their Uncertainty Dr. Yager Chapter 3.1.
Slide 1 of 48 Measurements and Their Uncertainty
Inen 460 Lecture 2. Estimation (ch. 6,7) and Hypothesis Testing (ch.8) Two Important Aspects of Statistical Inference Point Estimation – Estimate an unknown.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Multivariate selective editing via mixture models: first applications to Italian structural business surveys Orietta Luzi, Guarnera U., Silvestri F., Buglielli.
Slide 1 of 48 Measurements and Their Uncertainty
Chapter 5 Sampling Distributions. The Concept of Sampling Distributions Parameter – numerical descriptive measure of a population. It is usually unknown.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
CHAPTER 1 EVERYTHING YOU EVER WANTED TO KNOW ABOUT STATISTCS.
CHAPTER – 1 UNCERTAINTIES IN MEASUREMENTS. 1.3 PARENT AND SAMPLE DISTRIBUTIONS  If we make a measurement x i in of a quantity x, we expect our observation.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
Chapter Eleven Sample Size Determination Chapter Eleven.
Chapter 7: The Distribution of Sample Means
Chapter 4 Variability PowerPoint Lecture Slides Essentials of Statistics for the Behavioral Sciences Seventh Edition by Frederick J Gravetter and Larry.
Introduction to emulators Tony O’Hagan University of Sheffield.
11/25/03 3D Model Acquisition by Tracking 2D Wireframes Presenter: Jing Han Shiau M. Brown, T. Drummond and R. Cipolla Department of Engineering University.
So that k k E 5 = - E 2 = = x J = x J Therefore = E 5 - E 2 = x J Now so 631.
Confidence Intervals and Sample Size
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Sample Size Determination
Dr.MUSTAQUE AHMED MBBS,MD(COMMUNITY MEDICINE), FELLOWSHIP IN HIV/AIDS
Ch8: Nonparametric Methods
Statistics in Applied Science and Technology
Neural Networks and Their Application in the Fields of Coporate Finance By Eric Séverin Hanna Viinikainen.
Review of Statistical Inference
Measurements and Their Uncertainty 3.1
Using Clustering to Make Prediction Intervals For Neural Networks
Presentation transcript:

PROBABILISTIC ASSESSMENT OF THE QSAR APPLICATION DOMAIN Nina Jeliazkova 1, Joanna Jaworska 2 (1) IPP, Bulgarian Academy of Sciences, Sofia, Bulgaria (2) Central Product Safety, Procter & Gamble, Brussels, Belgium Abstract A key element of the quality prediction is to define whether a QSAR model is suitable to predict the queried chemical and prevent its potential. The training data set from which a QSAR model is derived provides a basis for the estimation of its application domain. We demonstrate that the classic approaches based on interpolation regions of the training data set, may not be sufficient. We propose a refined concept of interpolation regions by probability density distribution approach. This approach is more robust because it can identify the actual distribution of the data. However, the presence of a point in an empty or a dense region can be used only as a warning for model applicability, but not as a final decision on the correctness of the predicted values. Uncertainty estimation of the model prediction is necessary to complete the process of rejecting and accepting the model result. Results and conclusions What is an application domain ? Application domain is Projection of training set in the model’s parameter space NOT the space where the prediction uncertainty is known 1D : parameter intervals determine interpolation region 2D, 3D, nD : are parameter intervals sufficient? If an experimental design was used during the model development, then the parameter space is covered in a homogenous way. BUT this is rare in QSAR model development and empty space within the parameter range is possible. We need to refine the approach and be able to identify the empty regions within the interpolated space Classic approaches to estimate interpolation regions Geometric Probabilistic for standard distributions (mostly 1D tests) Descriptor rangesDistances How to assess it? We need to find the interpolation regions in model’s parameters space It is interpolation because in general QSAR models are statistical models Generic probabilistic approach in a multivariate space The algorithm which we use to estimate the probability density for a n- dimensional data set is: 1.Standardize the training data set (scale and center); 2.Extract the principal components of the data set; 3.Perform skewness correction transformation along each principal component; 4.Estimate the one-dimensional density on each transformed principal component; 5.Estimate the n-dimensional density p(x) ; 6.Estimate smallest regions containing specified percent of the data points (e.g. 99%, 90%, 75%, 50%, 25%, 10% of the data – see colored regions at the figure at the right) 7.For a query point (e.g. a new compound), transform the point according to the 1), 2) and 3) transformations and obtain the one-dimensional density values for each parameter, then multiply over the principal components to obtain the n- dimensional density. 8.Calculate whether the query point is inside or outside the parameter space, containing specified percent of the training set data points. (f) same as (d) with skewness correction (e) same as (c) with skewness correction (d) a joint 2D kernel density estimation in the principal components space (c) a product of 1D densities in the principal components space (b) a joint 2D kernel density estimation in the descriptor space (a) a product of 1D densities in the original descriptor space (g) Weighted Euclidean distance (e) Mahalanobis distance Data transformation is necessary to obtain the true shape (a) Density estimated without PCA and skewness correction We refined the concepts of interpolation by using data distribution approach. The method is unique in its ability to detect empty regions within the interpolated space On average the predictions within the interpolation space are more accurate than outside of the interpolation space. Example Root Mean Square Error of the test set inside and outside of the application domain: SRC Kowwin (see poster) Phenols Application domain filter is not sufficient for rejecting or accepting results of the modeling. One needs to assess uncertainty of the prediction and eventually make decision based on both criteria. (b) Density estimated with PCA and skewness correction (d) Density approximated with weighted Euclidean distance (a) Density estimated without PCA and skewness correction (b) Density estimated with PCA and skewness correction (d) Density approximated with Euclidean distance (b) Density estimated with PCA and skewness correction