“PREDICTIVE MODELING” CoSBBI, July 11 2013 Jennifer Hu.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Decision Making Under Risk Continued: Bayes’Theorem and Posterior Probabilities MGS Chapter 8 Slides 8c.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Correlation and regression Dr. Ghada Abo-Zaid
COUNTING AND PROBABILITY
Copyright © 2009 Cengage Learning 9.1 Chapter 9 Sampling Distributions.
Intro to Bayesian Learning Exercise Solutions Ata Kaban The University of Birmingham 2005.
Maureen Meadows Senior Lecturer in Management, Open University Business School.
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Evaluation.
The Simple Regression Model
Evaluating Hypotheses
BHS Methods in Behavioral Sciences I
CHAPTER 6 Statistical Analysis of Experimental Data
Inferences About Process Quality
5-3 Inference on the Means of Two Populations, Variances Unknown
Correlation and Regression Analysis
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Chapter 5 Sampling Distributions
 Mean: true average  Median: middle number once ranked  Mode: most repetitive  Range : difference between largest and smallest.
Chapter 1 Basics of Probability.
CORRELATION & REGRESSION
Covariance and correlation
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 9: Testing a Claim Section 9.3a Tests About a Population Mean.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Statistical Inferences Based on Two Samples Chapter 9.
Inference for a Single Population Proportion (p).
Basic statistics 11/09/13.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Review: Estimating a Mean
PARAMETRIC STATISTICAL INFERENCE
Lecture 5a: Bayes’ Rule Class web site: DEA in Bioinformatics: Statistics Module Box 1Box 2Box 3.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Stats/Methods I JEOPARDY. Jeopardy CorrelationRegressionZ-ScoresProbabilitySurprise $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Basic Probability (Chapter 2, W.J.Decoursey, 2003) Objectives: -Define probability and its relationship to relative frequency of an event. -Learn the basic.
Association between 2 variables
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
CpSc 810: Machine Learning Evaluation of Classifier.
CORRELATIONS: TESTING RELATIONSHIPS BETWEEN TWO METRIC VARIABLES Lecture 18:
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
Chi- square test x 2. Chi Square test Symbolized by Greek x 2 pronounced “Ki square” A Test of STATISTICAL SIGNIFICANCE for TABLE data.
Topic 2: Intro to probability CEE 11 Spring 2002 Dr. Amelia Regan These notes draw liberally from the class text, Probability and Statistics for Engineering.
June 11, 2008Stat Lecture 10 - Review1 Midterm review Chapters 1-5 Statistics Lecture 10.
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Elementary Probability.  Definition  Three Types of Probability  Set operations and Venn Diagrams  Mutually Exclusive, Independent and Dependent Events.
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Chapter 13 Understanding research results: statistical inference.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Sampling Distributions Chapter 18. Sampling Distributions A parameter is a number that describes the population. In statistical practice, the value of.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Inference for a Single Population Proportion (p)
Probability and Statistics
Normal Distribution.
Chapter 5 Sampling Distributions
Lecture 13 Sections 5.4 – 5.6 Objectives:
Chapter 2.3 Counting Sample Points Combination In many problems we are interested in the number of ways of selecting r objects from n without regard to.
Further Stats 1 Chapter 5 :: Central Limit Theorem
Presentation transcript:

“PREDICTIVE MODELING” CoSBBI, July Jennifer Hu

REVIEW: CONDITIONAL PROBABILITY The conditional probability of an event A given the knowledge that an event B has already occurred is denoted P(A|B). If P(B) > 0, Clearly, if events A and B are independent so that event B has no effect on the probability of event A, the conditional probability of event A given event B is simply the probability of event A, that is, P(A). Draw a Venn diagram of this to convince yourself if it is not clear.

Example Questions 1.Two fair dice are thrown. Given that the first shows 3, what is the probability that the total exceeds 6? 2.A family has two children. a)What is the probability that both are boys, given that at least one is a boy? b)What is the probability that both are boys, given that the older one is a boy? (For student) 3.A machine produces parts that are either good (80%), slightly defective (10%), or obviously defective (10%). Produced parts get passed through an automatic inspection machine, which is able to detect any part that is obviously defective and discard it. What is the probability that a part is good, given that it passed the machine? (For student)

A little bit more on conditional probabilities: Lemma: For any events A, B such that 0<P(B)<1, P(A) = P(A|B) P (B) + P(A|B c ) P(B c ) More generally, if B 1, B 2,…, B n is partition of the sample space S such that P(B i )>0, then

Illustrative question: The Superhero Elixir Only 2 pharmaceutical companies manufacture the Superhero Elixir ©. 20% of the elixir samples from company I and 5% from company II are defective and will turn you into a slobbering monster upon consumption. Company I produces 2x as much elixir as company II each week. a)Your friend presents you with a vial of Superhero Elixir ©, randomly chosen from 1 week’s production. You immediately take it. What is the probability that you become a superhero? b)Unfortunately, you draw the short end of the stick and turn into a slobbering monster instead. Is company I to blame? Compute the probability that your elixir was produced by company I.

BAYES’ THEOREM Simple form: You should all be able to prove this in one line. (Hint: Recall the definition of conditional probability.) Now let’s talk a bit about diagnostic tests. Q1: What is sensitivity? Q2: What is specificity? Q3: What is positive predictive value? Q4: What is negative predictive value?

Definitions Sensitivity: true positive rate (e.g. the percentage of sick people who are correctly identified as having the condition) Specificity: true negative rate (e.g. the percentage of healthy people who are correctly identified as not having the condition) Positive predictive value: given that you test positive, the probability that you actually have the condition. Negative predictive value: given that you test negative, the probability that you actually do not have the condition.

Example Questions 1.The prevalence of streptococcal pharyngeal infection in a small village with 500 people is 10%. A new test with sensitivity 90% and specificity 95% has been developed. a)What is the positive predictive value (PPV)? b)What is the negative predictive value (NPV)? c)There is another village with prevalence 20%. What is the PPV and NPV in this case? (For student) 2.Suppose that a drug test is 99% sensitive and 99% specific. 1% of people use the drug. You test positive. What is the probability that you use the drug? (For student) Remember the question you were asked on the pre-test? “The zombie apocalypse has come upon us in the form of a virus…” You should be able to answer it now.

BAYESIAN INFERENCE Treat unknown quantities as random variables (so can assign probabilities). Use Bayes’ theorem to systematically update prior knowledge in the presence of observed data. Let’s now work out the example in chapter 2 of your reading. Assume organism has 20,000 genes. The gold standard is as follows (“positive example” = associated with disease, “negative example” = not associated with disease): Under condition A, gene i performs above the median. Given this observation, what is the probability that gene i is associated with the disease (is a “positive example”)?

But we also have dataset we derived from condition B and C! What do we do with that? Assume that experimental results from different datasets are independent. Then, we can use the probability we just derived as the prior probability, and perform the same calculation for gene i in experimental condition B, and the calculation for condition C using the probability from condition B as the prior.

CORRELATION, PEARSON’S CORRELATION COEFFICIENT, AND FISCHER’S Z TRANSFORM When we say that two genes are correlated, we mean that they vary together. But how to quantify the degree of correlation? Pearson’s r measures the extent to which two random variables are linearly related. A value of 1 indicates a perfect positive correlation (that is, as one variable increases, the other increases proportionally in linear fashion). A value of -1 indicates a perfect negative correlation.

We don't usually know rho, the population correlation, so we use the statistic r to estimate rho and to carry out tests of hypotheses. The most common test is whether r = 0, that is whether the correlation is significantly different from zero. A sampling distribution is what you get if you take repeated samples from a population and compute a statistic each time you take a sample. In the case that rho is not 0, the sampling distribution is skewed (Oh no! This makes it difficult for hypothesis testing). Enter Fisher’s z transform…… z is approximately normal with mean and variance 1/(n-3)

ASSESSING QUALITY OF THE PREDICTIVE MODEL ROC-AUC The area under the curve is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. Q: Why is the blue curve worthless? BUT, this approach to evaluating quality of the model is problematic, because it only tests the ability of the model to match the gold standard, and not its ability to make new predictions.

Method 1 (hold-out validation): divide gold standard into “training set” and “validation set”. Problem: when gold standards are small, too few known relationships for network assessment. Method 2 (k-fold cross validation): the original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross- validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds then can be averaged (or otherwise combined) to produce a single estimation. Method 3 (blind literature evaluation): Use existing literature. Select genes that are predicted with high probability for follow up. Also, combine with randomly selected gene to create gene list for evaluation. Assess literature evidence on genes on the list.