Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Methods in Computer Science Data 3: Correlations and Dependencies Ido Dagan.

Similar presentations


Presentation on theme: "Statistical Methods in Computer Science Data 3: Correlations and Dependencies Ido Dagan."— Presentation transcript:

1 Statistical Methods in Computer Science Data 3: Correlations and Dependencies Ido Dagan

2 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 2 Connecting Variables So far: talked about the data reflected by a single variable Common scientific goal: relate between variables Find out whether a relation exists between values of variables Find out the strength of this relation Find out the nature of this relation Our focus here: The relation between two variables e.g., the relation between input size and run-time e.g., the relation between time spent coordinating, and productivity e.g., the relation between shoe-size and reading skills

3 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 3 Paired Samples The starting point for our discussion: Bi-variate data Paired samples, for each X, give its corresponding Y: These paired samples come from the experiment The experiment should record the data to allow us the desired pairing Pairing can be implicit, through fields/variables Test at beginning of year, test at end of year: pair by student

4 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 4 Tools in identifying bi-variate relations Visualize: Scatter Diagram (Scatter Plot) Ordinal variables: Pearson's correlation coefficient, r XY Spearman's rank-correlation coefficient, rho (  ) Categorical variables Dependency tests (Chi-Square – in recitation)

5 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 5 Visualization: the X-Y Scatter Plot One variable declared X, the other Y Axes of equal length (make it easier to see) Plot values of X and Y together For each X, plot matching Y (or Ys).

6 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 6 Is there a relation? We see that in general, there is some relation here: Lower X => lower Y Higher X => higher Y But how can we recognize this systematically? From “Statistical Reasoning”, Minium, King, and Bear 1993

7 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 7

8 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 8 Reminder: Variance Sum of squares Shorthand for: Sum of squared deviations from the mean And normalizing for the size of the sample oThis is called the variance of the sample oDistribution/Population variance is denoted by, defined relative to μ

9 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 9 Covariance Positive correlation: Lower X Lower Y Negative correlation: Lower X Higher Y How do we transform this into a measure? Intuition: Multiply pairs, and sum the results positive X positive = positive; negative X negative = positive,.... Covariance sign determined by accumulative values from points in 1 st & 3 rd quartiles vs. 2 nd & 4 th big X small = small, big X big = big

10 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 10 From Covariance to Correlation Big positive Cov(X,Y) means that X, Y grow together Big negative Cov(X,Y) means that X, Y grow negatively together Problem: How big is big? This depends on the values of X, Y For instance: Large x (100000) multiplied by small y (0.00001) Where both x and y are the largest values? Solution: Pearson's correlation coefficient r XY (or simply, r): 1.0: Perfect positive correlation -1.0: Perfect negative correlation 0: No correlation

11 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 11 Reminder: z Scores Key idea: Express all values in units of standard deviation This allows comparison of values from different distributions But only if shapes of distributions are similar Example usage: Sequence mining We find the most frequent sequences of any length k What are the most frequent sequences of the entire DB? This is difficult to answer: There are more short sequences than long ones This can be solved with transforming frequency counts into their z Scores

12 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 12 Formulas for r z-Score based formula: Deviation-score based formula (equivalent): where S k denotes the standard deviation of variable k.

13 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 13 Warning about misleading curves Using r is no substitute for visualization. Always Visualize! r good for linear relationships r =+0.82 From Anscombe, 1973

14 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 14 Warning about misleading curves Using r is no substitute for visualization. Always Visualize! r good for linear relationships r =+0.82 From Anscombe, 1973

15 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 15 Correlation and Transformations Mean changes with additions, std dev does not Raise all scores by 10 ==> raise mean by 10, no change to stddev Mean changes with multiplications, std dev does too Multiply all scores by 10 ==> multiple mean & std dev by 10. Pearson's r not affected by any linear transformation, on either X and/or Y Adding = translating points Multiplying = scaling Neither affects relation between the variables.

16 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 16 Interpreting Correlation Always visualize! Pearson's coefficient only appropriate for linear relationships r measures how closely points “hug” a straight line Other measures exist for non-linear relations (Spearman's, eta) r sensitive to value ranges within the target population Smaller range => smaller r - differences in values are less meaningful E.g. correlation between age and math skills for a small age range Large absolute r is not necessarily indicative of significance r is subject to sampling variation: May change from sample to sample, and significance depends on sample size We will address significance test of r later r is affected by the way some phenomenon is measured (e.g. grades on different types of scales – grades A,B,… vs. 1-100)  Need to report specific conditions for correlation measurements, and test again under different conditions to see if still correlated

17 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 17 Correlation and Causation IMPORTANT: Correlation is not causation! Example of positive correlations: Grip strength and mathematical skills Shoe size and reading level... But shoe sizes does not causes reading level! The results are in kids 6-13!

18 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 18 Possible Explanations Two correlated variables may be: Causally related (one causes the other) Affected by the same third variable (that causes both – control variable) Two uncorrelated variables (according to r) may be: Correlated in highly non-linear fashion (always visualize!) E.g. a circle around 0 (balanced in all quartiles) There are specific ways to address these cases Example: Partial correlation Correlation of a,b, given c Example: Manipulation controls (experiment design) E.g. measure grip strength vs. math skill separately in different age groups

19 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 19 Partial Correlation A test for correlation between a, b, given c intuitively, correlation between a & b remaining after neutralizing their correlation with c For instance (“Empirical Methods in AI”, Cohen 1995)

20 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 20 Visualize as well From “Empirical Methods in AI”, Cohen 1995

21 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 21 Correlation for ordinal variables Pearson's coefficient is intended for ratio and interval data Ordinal data cannot be used as is Here, difference between subsequent values is meaningless Only direction matters (above or below) Examples: Correlation between military rank of career soldiers and the time they have been in the army Correlation between user and system ranking of search results Spearman's rank-correlation (rho, ) addresses this

22 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 22 Spearman's rho: Step 1 First step: Transform all scores to ranks First = 1, Second = 2,..... Ties: Replace with average of intended ranks For instance, for ordinal data: X = Private Sgt. Sgt. Lt. Capt. Capt. Capt. Maj. Col. Col. General ==> X rank = 1 2.5 2.5 4 6 6 6 8 9.5 9.5 11 (2+3)/2 (5+6+7)/3 (9+10)/2

23 Statistical Methods in Computer Science © 2006-now Gal Kaminka / Ido Dagan 23 Calculating rho: Step 2 Generally: Ranges in [-1,1] With no ties, can simply use Pearson's r on the ranks with identical results May be useful (in addition to r) also for data of numerical scores, when we don’t trust the scale properties of the scores and rank really matters –E.g. correlation between user and system relevance scores for the ranked pages in search results “Debugging” note: –maintained for averaged ties, as sum of all ranks (for X and Y) = n(n-1)/2


Download ppt "Statistical Methods in Computer Science Data 3: Correlations and Dependencies Ido Dagan."

Similar presentations


Ads by Google