Statistical Methods in Computer Science Data 3: Correlations and Dependencies Ido Dagan.

Slides:



Advertisements
Similar presentations
Richard M. Jacobs, OSA, Ph.D.
Advertisements

Chapter 3, Numerical Descriptive Measures
CORRELATION. Overview of Correlation u What is a Correlation? u Correlation Coefficients u Coefficient of Determination u Test for Significance u Correlation.
Correlation and regression Dr. Ghada Abo-Zaid
Table of Contents Exit Appendix Behavioral Statistics.
Correlation & Regression Chapter 15. Correlation statistical technique that is used to measure and describe a relationship between two variables (X and.
CORRELATION. Overview of Correlation u What is a Correlation? u Correlation Coefficients u Coefficient of Determination u Test for Significance u Correlation.
Lecture 3: Chi-Sqaure, correlation and your dissertation proposal Non-parametric data: the Chi-Square test Statistical correlation and regression: parametric.
Regression Analysis. Unscheduled Maintenance Issue: l 36 flight squadrons l Each experiences unscheduled maintenance actions (UMAs) l UMAs costs $1000.
Statistical Methods in Computer Science Data 2: Central Tendency & Variability Ido Dagan.
Chap 3-1 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 3 Describing Data: Numerical.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch. 2-1 Statistics for Business and Economics 7 th Edition Chapter 2 Describing Data:
Chapter Eighteen MEASURES OF ASSOCIATION
Basic Business Statistics 10th Edition
Social Research Methods
Measures of Association Deepak Khazanchi Chapter 18.
Chap 3-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 3 Describing Data: Numerical Statistics for Business and Economics.
Z Scores & Correlation Greg C Elvers.
Relationships Among Variables
STATISTICS ELEMENTARY C.M. Pascual
Joint Distributions AND CORRELATION Coefficients (Part 3)
@ 2012 Wadsworth, Cengage Learning Chapter 5 Description of Behavior Through Numerical 2012 Wadsworth, Cengage Learning.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Statistics for Managers.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Simple Covariation Focus is still on ‘Understanding the Variability” With Group Difference approaches, issue has been: Can group membership (based on ‘levels.
Covariance and correlation
Correlation.
Chapter 3 – Descriptive Statistics
Chapter 15 Correlation and Regression
1 Chapter 9. Section 9-1 and 9-2. Triola, Elementary Statistics, Eighth Edition. Copyright Addison Wesley Longman M ARIO F. T RIOLA E IGHTH E DITION.
Regression and Correlation. Bivariate Analysis Can we say if there is a relationship between the number of hours spent in Facebook and the number of friends.
Chapter 11 Descriptive Statistics Gay, Mills, and Airasian
Descriptive Statistics
Copyright © 2012 Pearson Education. Chapter 23 Nonparametric Methods.
Statistics in Applied Science and Technology Chapter 13, Correlation and Regression Part I, Correlation (Measure of Association)
Hypothesis of Association: Correlation
Basic Statistics Correlation Var Relationships Associations.
Descriptive Statistics
B AD 6243: Applied Univariate Statistics Correlation Professor Laku Chidambaram Price College of Business University of Oklahoma.
Investigating the Relationship between Scores
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.
Examining Relationships in Quantitative Research
Chapter 13: Correlation An Introduction to Statistical Problem Solving in Geography As Reviewed by: Michelle Guzdek GEOG 3000 Prof. Sutton 2/27/2010.
By: Amani Albraikan.  Pearson r  Spearman rho  Linearity  Range restrictions  Outliers  Beware of spurious correlations….take care in interpretation.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 3-1 Chapter 3 Numerical Descriptive Measures Business Statistics, A First Course.
CORRELATION. Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson’s coefficient of correlation.
The basic task of most research = Bivariate Analysis
Describing Data Descriptive Statistics: Central Tendency and Variation.
Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc.
Chapter 16: Correlation. So far… We’ve focused on hypothesis testing Is the relationship we observe between x and y in our sample true generally (i.e.
1.  In the words of Bowley “Dispersion is the measure of the variation of the items” According to Conar “Dispersion is a measure of the extent to which.
Educational Research: Data analysis and interpretation – 1 Descriptive statistics EDU 8603 Educational Research Richard M. Jacobs, OSA, Ph.D.
Chapter 15: Correlation. Correlations: Measuring and Describing Relationships A correlation is a statistical method used to measure and describe the relationship.
Summarizing Data Graphical Methods. Histogram Stem-Leaf Diagram Grouped Freq Table Box-whisker Plot.
1 MVS 250: V. Katch S TATISTICS Chapter 5 Correlation/Regression.
Statistics Josée L. Jarry, Ph.D., C.Psych. Introduction to Psychology Department of Psychology University of Toronto June 9, 2003.
Statistical Methods © 2004 Prentice-Hall, Inc. Week 3-1 Week 3 Numerical Descriptive Measures Statistical Methods.
Educational Research Descriptive Statistics Chapter th edition Chapter th edition Gay and Airasian.
Slide Slide 1 Chapter 10 Correlation and Regression 10-1 Overview 10-2 Correlation 10-3 Regression 10-4 Variation and Prediction Intervals 10-5 Multiple.
Principles of Biostatistics Chapter 17 Correlation 宇传华 网上免费统计资源(八)
©2013, The McGraw-Hill Companies, Inc. All Rights Reserved Chapter 3 Investigating the Relationship of Scores.
Descriptive Statistics ( )
Theme 5. Association 1. Introduction. 2. Bivariate tables and graphs.
Business and Economics 6th Edition
CORRELATION.
Elementary Statistics
Chapter 15: Correlation.
Inverse Transformation Scale Experimental Power Graphing
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Business and Economics 7th Edition
Presentation transcript:

Statistical Methods in Computer Science Data 3: Correlations and Dependencies Ido Dagan

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 2 Connecting Variables So far: talked about the data reflected by a single variable Common scientific goal: relate between variables Find out whether a relation exists between values of variables Find out the strength of this relation Find out the nature of this relation Our focus here: The relation between two variables e.g., the relation between input size and run-time e.g., the relation between time spent coordinating, and productivity e.g., the relation between shoe-size and reading skills

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 3 Paired Samples The starting point for our discussion: Bi-variate data Paired samples, for each X, give its corresponding Y: These paired samples come from the experiment The experiment should record the data to allow us the desired pairing Pairing can be implicit, through fields/variables Test at beginning of year, test at end of year: pair by student

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 4 Tools in identifying bi-variate relations Visualize: Scatter Diagram (Scatter Plot) Ordinal variables: Pearson's correlation coefficient, r XY Spearman's rank-correlation coefficient, rho (  ) Categorical variables Dependency tests (Chi-Square – in recitation)

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 5 Visualization: the X-Y Scatter Plot One variable declared X, the other Y Axes of equal length (make it easier to see) Plot values of X and Y together For each X, plot matching Y (or Ys).

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 6 Is there a relation? We see that in general, there is some relation here: Lower X => lower Y Higher X => higher Y But how can we recognize this systematically? From “Statistical Reasoning”, Minium, King, and Bear 1993

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 7

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 8 Reminder: Variance Sum of squares Shorthand for: Sum of squared deviations from the mean And normalizing for the size of the sample oThis is called the variance of the sample oDistribution/Population variance is denoted by, defined relative to μ

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 9 Covariance Positive correlation: Lower X Lower Y Negative correlation: Lower X Higher Y How do we transform this into a measure? Intuition: Multiply pairs, and sum the results positive X positive = positive; negative X negative = positive,.... Covariance sign determined by accumulative values from points in 1 st & 3 rd quartiles vs. 2 nd & 4 th big X small = small, big X big = big

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 10 From Covariance to Correlation Big positive Cov(X,Y) means that X, Y grow together Big negative Cov(X,Y) means that X, Y grow negatively together Problem: How big is big? This depends on the values of X, Y For instance: Large x (100000) multiplied by small y ( ) Where both x and y are the largest values? Solution: Pearson's correlation coefficient r XY (or simply, r): 1.0: Perfect positive correlation -1.0: Perfect negative correlation 0: No correlation

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 11 Reminder: z Scores Key idea: Express all values in units of standard deviation This allows comparison of values from different distributions But only if shapes of distributions are similar Example usage: Sequence mining We find the most frequent sequences of any length k What are the most frequent sequences of the entire DB? This is difficult to answer: There are more short sequences than long ones This can be solved with transforming frequency counts into their z Scores

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 12 Formulas for r z-Score based formula: Deviation-score based formula (equivalent): where S k denotes the standard deviation of variable k.

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 13 Warning about misleading curves Using r is no substitute for visualization. Always Visualize! r good for linear relationships r =+0.82 From Anscombe, 1973

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 14 Warning about misleading curves Using r is no substitute for visualization. Always Visualize! r good for linear relationships r =+0.82 From Anscombe, 1973

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 15 Correlation and Transformations Mean changes with additions, std dev does not Raise all scores by 10 ==> raise mean by 10, no change to stddev Mean changes with multiplications, std dev does too Multiply all scores by 10 ==> multiple mean & std dev by 10. Pearson's r not affected by any linear transformation, on either X and/or Y Adding = translating points Multiplying = scaling Neither affects relation between the variables.

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 16 Interpreting Correlation Always visualize! Pearson's coefficient only appropriate for linear relationships r measures how closely points “hug” a straight line Other measures exist for non-linear relations (Spearman's, eta) r sensitive to value ranges within the target population Smaller range => smaller r - differences in values are less meaningful E.g. correlation between age and math skills for a small age range Large absolute r is not necessarily indicative of significance r is subject to sampling variation: May change from sample to sample, and significance depends on sample size We will address significance test of r later r is affected by the way some phenomenon is measured (e.g. grades on different types of scales – grades A,B,… vs )  Need to report specific conditions for correlation measurements, and test again under different conditions to see if still correlated

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 17 Correlation and Causation IMPORTANT: Correlation is not causation! Example of positive correlations: Grip strength and mathematical skills Shoe size and reading level... But shoe sizes does not causes reading level! The results are in kids 6-13!

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 18 Possible Explanations Two correlated variables may be: Causally related (one causes the other) Affected by the same third variable (that causes both – control variable) Two uncorrelated variables (according to r) may be: Correlated in highly non-linear fashion (always visualize!) E.g. a circle around 0 (balanced in all quartiles) There are specific ways to address these cases Example: Partial correlation Correlation of a,b, given c Example: Manipulation controls (experiment design) E.g. measure grip strength vs. math skill separately in different age groups

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 19 Partial Correlation A test for correlation between a, b, given c intuitively, correlation between a & b remaining after neutralizing their correlation with c For instance (“Empirical Methods in AI”, Cohen 1995)

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 20 Visualize as well From “Empirical Methods in AI”, Cohen 1995

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 21 Correlation for ordinal variables Pearson's coefficient is intended for ratio and interval data Ordinal data cannot be used as is Here, difference between subsequent values is meaningless Only direction matters (above or below) Examples: Correlation between military rank of career soldiers and the time they have been in the army Correlation between user and system ranking of search results Spearman's rank-correlation (rho, ) addresses this

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 22 Spearman's rho: Step 1 First step: Transform all scores to ranks First = 1, Second = 2,..... Ties: Replace with average of intended ranks For instance, for ordinal data: X = Private Sgt. Sgt. Lt. Capt. Capt. Capt. Maj. Col. Col. General ==> X rank = (2+3)/2 (5+6+7)/3 (9+10)/2

Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 23 Calculating rho: Step 2 Generally: Ranges in [-1,1] With no ties, can simply use Pearson's r on the ranks with identical results May be useful (in addition to r) also for data of numerical scores, when we don’t trust the scale properties of the scores and rank really matters –E.g. correlation between user and system relevance scores for the ranked pages in search results “Debugging” note: –maintained for averaged ties, as sum of all ranks (for X and Y) = n(n-1)/2