Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.

Similar presentations


Presentation on theme: "CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools."— Presentation transcript:

1 CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools

2 Textbook Reading  Section 7.5  Goodness-of-Fit Tests for Distributions, page134  7.5.1 Chi-Square Test 134  7.5.2 Kolmogorov-Smirnov (K-S) Test 137  Section 3.6  Correlation, page 32

3 Goals of Today  Know how to compare between two distributions  Know how to evaluate the relationship between two random variable

4 Outline  Comparing Distributions: Tests for Goodness-of-Fit  Chi-Square Distribution (for discrete models: PMF)  Kolmogorov-Smirnov Test (for continuous models: CDF)  Evaluating the relationship  Linear Regression  Correlation

5 Goodness-of-fit  Statistical Tests enables to compare between two distributions, also known as Goodness-of-Fit.  The goodness-of-fit of a statistical model describes how well it fits a set of observations.  Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question  Goodness-of-fit means how well a statistical model fits a set of observations جودة المطابقة

6 Pearson’s  ²-Tests Chi-Square Tests for Discrete Models The Pearson's chi-square test enables to compare two probability mass functions of two distribution. If the difference value (Error) is greater than the critical value, the two distribution are said to be different or the first distribution does not fit (well) the second distribution. If the difference if smaller that the critical value, the first distribution fits well the second distribution

7 (Pearson's ) Chi-Square test  Pearson's chi-square is used to assess two types of comparison:  tests of goodness of fit: it establishes whether or not an observed frequency distribution differs from a theoretical distribution.  tests of independence. it assesses whether paired observations on two variables are independent of each other.  For example, whether people from different regions differ in the frequency with which they report that they support a political candidate.  If the chi-square probability is less or equal to 0.05 then we say that  both distributions are equal (goodness-of-fit) or that  the row variable is unrelated (that is, only randomly related) to the column variable (test of independence).

8 Chi-Square Distribution http://en.wikipedia.org/wiki/Chi-square_distribution

9 Chi-Square Distribution http://en.wikipedia.org/wiki/Chi-square_distribution

10 (Pearson's ) Chi-Square test  The chi-square test, in general, can be used to check whether an empirical distribution follows a specific theoretical distribution.  Chi-square is calculated by finding the difference between each observed (O) and theoretical or expected (E) frequency for each possible outcome, squaring them, dividing each by the theoretical frequency, and taking the sum of the results.  For n data outcomes (observations), the chi-square statistic is defined as:  O i = an observed frequency for a given outcome;  E i = an expected (theoretical) frequency for a given outcome;  n = the number of possible outcomes of each event;

11 (Pearson's ) Chi-Square test A chi-square probability of 0.05 or less is the criteria to accept or reject the test of difference between the empirical and theoretical distributions.

12 Chi-Square test: General Algorithm  We say that the observed distribution (empirical) fits well the expected distribution (theoretical) if:  (k – 1 – c) is the degree of freedom, where  k is the number of possible outcome and  c is the number of estimated parameters. 1-  is the confidence level (basically, we use  = 0.05) http://en.wikipedia.org/wiki/Inverse-chi-square_distribution

13 Chi-Square test: Example Uniform distribution in [0.. 9] PASS

14 (KS-Test) Kolmogorov – Smirnov Test for Continuous Models  In statistics, the Kolmogorov–Smirnov test (K–S test) quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the expected distribution, or between the empirical distribution functions of two samples.  It can be used for both continuous and discrete models  Basic idea: compute the maximum distance between two cumulative distribution functions and compare it to critical value.  If the maximum distance is smaller than the critical value, the first distribution fits the second distribution  If the maximum distance is greater than the critical value, the first distribution does not fit the second distribution

15 Kolmogorov – Smirnov test  In statistics, the Kolmogorov – Smirnov test is used to determine  whether two one-dimensional probability distributions differ, or  whether an probability distribution differs from a hypothesized distribution, in either case based on finite samples.  The Kolmogorov-Smirnov test statistic measures the largest vertical distance between an empirical cdf calculated from a data set and a theoretical cdf.  The one-sample KS-test compares the empirical distribution function with a cumulative distribution function.  The main applications are testing goodness-of-fit with the normal and uniform distributions.

16 Kolmogorov–Smirnov Statistic  Let X1, X2, …, Xn be iid random variables in with the CDF equal to F(x).  The empirical distribution function F n (x) based on sample X1, X2, …, Xn is a step function defined by:  The Kolmogorov-Smirnov test statistic for a given function F(x) is

17 Kolmogorov–Smirnov Statistic  The Kolmogorov-Smirnov test statistic for a given function F(x) is Facts  By the Glivenko-Cantelli theorem, if the sample comes from a distribution F(x), then D n converges to 0 almost surely.  In other words, If X1, X2, …, Xn really come from the distribution with CDF F(X), the distance D n should be small

18 D max Example

19 Example: Grade Distribution ?  We would like to know the distribution of the Grades of students.  First, determine the empirical distribution  Second, compare to Normal and Poisson distributions  Data Sample: 50 Grades in a course and computed the empirical distribution  Mean = 63  Standard Deviation = 15

20 Example: Grade Distribution ?

21 D max,Poisson = 0.153 D max,Normal = 0.119

22 Kolmogorov–Smirnov Acceptance Criteria  Rejection Criteria: We consider that the two distributions are not equal if the empirical CDF is too far from the theoritical CDF of the proposed distribution  This means: We reject if D n is too large.  But the question is: What does large mean ? For which values of D n should we accept the distribution?

23 In the 1930’s, Kolmogorov and Smirnov showed that So, for large sample sizes, you could assume  level test : find the value of t such. So, the test is accepted if Kolmogorov–Smirnov test http://en.wikipedia.org/wiki/Kolmogorov-Smirnov_test Critical value

24  For small samples, people have worked out and tabulated critical values, but there is no nice closed form solution. J. Pomeranz (1973) J. Durbin (1968)  For Large Samples: Good approximations for n>40: Kolmogorov–Smirnov test

25 Example: Grade Distribution ?  For our example, we have n = 50  The critical value for a  = 0.05 ACCEPT

26 Example: Grade Distribution ?  If we get the same distribution for n = 100  The critical value for a  = 0.05 ACCEPT REJECT

27 Linear Regression: Least Square Method In statistics, linear regression is a form of regression analysis in which the relationship between one or more independent variables and another variable, called dependent variable, is modeled by a least squares function, called linear regression equation. This function is a linear combination of one or more model parameters, called regression coefficients. A linear regression equation with one independent variable represents a straight line. The results are subject to statistical analysis. http://en.wikipedia.org/wiki/Linear_regression

28 The Method of Least Squares  The equation of the best-fitting line is calculated using a set of n pairs (x i, y i ).  We choose our estimates a and b to estimate a and b so that the vertical distances of the points from the line, are minimized. SSE: Sum of Square of Errors

29 Least Squares Estimators

30 Example The table shows the math achievement test scores for a random sample of n = 10 college freshmen, along with their final calculus grades. Student12345678910 Math test, x39432164574728753452 Calculus grade, y65785282928973985675 Use your calculator to find the sums and sums of squares.

31 Example

32 Correlation Analysis In probability theory and statistics, correlation (often measured as a correlation coefficient) indicates the strength and direction of a linear relationship between two random variables. In general statistical usage, correlation or co-relation refers to the departure of two variables from independence. In this broad sense there are several coefficients, measuring the degree of correlation, adapted to the nature of the data.

33 Correlation Analysis coefficient of correlationThe strength of the relationship between x and y is measured using the coefficient of correlation: The sign of r indicates the direction of the relationship; r near 0 indicates no linear relationship, r near 1 or -1 indicates a strong linear relationship. A test of the significance of the correlation coefficient is identical to the test of the slope .

34 Example The table shows the heights and weights of n = 10 randomly selected college football players. Player12345678910 Height, x73717572 7567697169 Weight, y185175200210190195150170180175 Use your calculator to find the sums and sums of squares.

35 Football Players r =.8261 Strong positive correlation As the player’s height increases, so does his weight. r =.8261 Strong positive correlation As the player’s height increases, so does his weight.

36 Some Correlation Patterns r = 0; No correlation r =.931; Strong positive correlation r = 1; Linear relationship r = -.67; Weaker negative correlation


Download ppt "CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools."

Similar presentations


Ads by Google