Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Analytics (BE-2015 Pattern) Unit II Basic Data Analytic Methods

Similar presentations


Presentation on theme: "Data Analytics (BE-2015 Pattern) Unit II Basic Data Analytic Methods"— Presentation transcript:

1 Data Analytics (BE-2015 Pattern) Unit II Basic Data Analytic Methods
By Prof. B.A.Khivsara Note: The material to prepare this presentation has been taken from internet and are generated only for students reference and not for commercial use.

2 Syllabus Statistical Methods for Evaluation- Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA Advanced Analytical Theory and Methods: Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

3 Syllabus Statistical Methods for Evaluation- Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA Advanced Analytical Theory and Methods: Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

4 Statistical Methods for Evaluation-
Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA

5 What is Hypothesis? A hypothesis is an educated guess about something in the world around you. It should be testable, either by experiment or observation. For example: A new medicine you think might work. A way of teaching you think might be better.

6 What is a Hypothesis Statement?
Hypothesis statement will look like this: “If I…(do this to an independent variable)….then (this will happen to the dependent variable).” For example: If I (decrease the amount of water given to herbs) then (the herbs will increase in size). If I (give patients counseling in addition to medication) then (their overall depression scale will decrease).

7 What is Hypothesis Testing?
Hypothesis testing refers to Making an assumption, called hypothesis, about a population parameter. Collecting sample data. Calculating a sample statistic. Using the sample statistic to evaluate the hypothesis

8 Hypothesis Testing :Population & sample

9 Hypothesis Testing HYPOTHESIS TESTING Alternative hypothesis,HA
Null hypothesis, H0 Alternative hypothesis,HA State the hypothesized value of the parameter before sampling. The assumption we wish to test (or trying to reject) E.g µ = 20 There is no difference between coke and diet coke All possible alternatives other than the null hypothesis. E.g µ≠20 µ >20 µ < 20 There is a difference between coke and diet coke

10 Hypothesis Testing Basic concept is to form an assertion and test it with data Common assumption is that there is no difference between samples (default assumption) Statisticians refer to this as the null hypothesis (H0) The alternative hypothesis (HA) is that there is a difference between samples

11 What is the Null & alternate Hypothesis?
The null hypothesis is always the accepted fact or accepted as being true are: DNA is shaped like a double helix. There are 8 planets in the solar system (excluding Pluto). Given a population, the initial (assumed) hypothesis to be tested ,Ho , is called the null hypothesis. Rejection of null hypothesis causes another hypothesis,H1,is called the alternative hypothesis, to be made.

12 Statistical Methods for Evaluation-
Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA

13 mean, variance , standard deviation
μ if working with population X̄ if working with samples Mean (or Average) denoted by σ2 (for population) s2 (for sample) Variance denoted by σX or σ (for population) sX or s (for sample)) Standard deviation denoted by

14 Mean – is a simple average of given data values
Example 4,5,9,2,14,6 Mean X̄= ( ) /6 = 42/6 = 7

15 Variance: a measure of how data points differ from the mean
Data Set 1: 3, 5, 7, 10, 10 Data Set 2: 7, 7, 7, 7, 7 What is the mean of the above data set? Data Set 1: mean = 7 Data Set 2: mean = 7 But we know that the two data sets are not identical! The variance shows how they are different. We want to find a way to represent these two data set numerically.

16 How to Calculate variance?
If we conceptualize the spread of a distribution as the extent to which the values in the distribution differ from the mean and from each other, then a reasonable measure of spread might be the average deviation, or difference, of the values from the mean.

17 How to Calculate variance?
The average of the squared deviations about the mean is called the variance. For population variance For sample variance

18 Example 1- Variance The mean is 35/5 = 7. 35 Total 3 5 7 10 Score X
( )2 1 3 2 5 7 4 10 Total 35 The mean is 35/5 = 7.

19 Example 1- Variance 3 3-7=-4 5 5-7=-2 7 7-7=0 10 10-7=3 Totals 35
Score X ( )2 1 3 3-7=-4 2 5 5-7=-2 7 7-7=0 4 10 10-7=3 Totals 35

20 Example 1- Variance Totals 3 3-7=-4 16 5 5-7=-2 4 7 7-7=0 10 10-7=3 9
Score X ( )2 1 3 3-7=-4 16 2 5 5-7=-2 4 7 7-7=0 10 10-7=3 9 Totals 35 38

21 Example 1- Variance 3 3-7=-4 16 5 5-7=-2 4 7 7-7=0 10 10-7=3 9 Totals
Score X ( )2 1 3 3-7=-4 16 2 5 5-7=-2 4 7 7-7=0 10 10-7=3 9 Totals 35 38

22 Example 1- Variance Score X ( )2 1 7 7-7=0 2 3 4 5 Totals 35 0/5 =0

23 Example 2- Variance Drive Mark Myrna Which diver was more consistent?

24 Example 2- Variance Mark’s Variance = 64 / 5 = 12.8
Dive Mark's Score X ( )2 1 28 5 25 2 22 -1 3 21 -2 4 26 9 18 -5 Totals 115 64 Mark’s Variance = 64 / 5 = 12.8 Myrna’s Variance = 362 / 5 = 72.4 Conclusion: Mark has a lower variance therefore he is more consistent.

25 standard deviation - a measure of variation of scores about the mean
Can think of standard deviation as the average distance to the mean Higher standard deviation indicates higher spread, less consistency, and less clustering. sample standard deviation: population standard deviation:

26 Example – Standard Deviation
Dive Mark's Score X ( )2 1 28 5 25 2 22 -1 3 21 -2 4 26 9 18 -5 Totals 115 64 Mark’s Variance = 64 / 5 = 12.8 Mark’s Standard Deviation for population = 𝟏𝟐.𝟖 𝟓 =𝟏.𝟔 Mark’s Standard Deviation for sample 𝟏𝟐.𝟖 𝟒 =𝟏.78

27 Example- Variance & Standard Deviation
You have just measured the heights of your dogs (in mm) The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm. Find out the Mean, the Variance, and the Standard Deviation.

28 Example- Variance & Standard Deviation
Your first step is to find the Mean: Mean = ( )5 Mean = 1970/5 Mean = 394

29 Example- Variance & Standard Deviation
Now we calculate each dog's difference from the Mean

30 Example- Variance & Standard Deviation
To calculate the Variance, take each difference, square it, and then average the result: Variance So the Variance σ2 is 21,704 σ2 = (−224) (−94)2 / 5 / 5 / 5 21704

31 Example- Variance & Standard Deviation
And the Standard Deviation is just the square root of Variance, so: Standard Deviation σ = √21704 147 (to the nearest mm)

32 Example- Variance & Standard Deviation
And the good thing about the Standard Deviation is that it is useful. Now we can show which heights are within one Standard Deviation (147mm) of the Mean: So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small.

33 difference of means State the hypotheses Formulate an analysis plan
Analyze sample data using hypothesis test Interpret results.

34 Hypothesis Testing Procedures
EPI 809 / Spring 2008 Many More Tests Exist! 12

35 Parametric Test Procedures
1.Involve Population Parameters (Mean) 2.Have Stringent(strict) Assumptions (Normality) 3.Examples: Z Test, t Test, c2 Test, F test EPI 809 / Spring 2008

36 Nonparametric Test Procedures
1. Do Not Involve Population Parameters Example: Probability Distributions, Independence 2. Data Measured on Any Scale (Ratio or Interval, Ordinal or Nominal) 3. Example: Wilcoxon Rank Sum Test EPI 809 / Spring 2008

37 Parametric Test Procedures
EPI 809 / Spring 2008

38 A t test allows us to compare the means of two groups
The calculations for a t test requires three pieces of information: - the difference between the means (mean difference) - the standard deviation for each group - and the number of subjects(samples) in each group. 10 9 8 7 6 5 4 3 2 1 Spelling Test Scores

39 The size of the standard deviation also influences the outcome of a t test.
are more likely to report a significant difference than groups with larger standard deviations. Given the same difference in means, groups with smaller standard deviations 10 9 8 7 6 5 4 3 2 1 Spelling Test Scores 10 9 8 7 6 5 4 3 2 1 Spelling Test Scores

40 than larger standard deviations.
Less overlap would indicate that the groups are more different from each other. From a practical standpoint, we can see that smaller standard deviations produce less overlap between the groups than larger standard deviations. 10 9 8 7 6 5 4 3 2 1 Spelling Test Scores 10 9 8 7 6 5 4 3 2 1 Spelling Test Scores

41 Difference of Means Two populations – same or different?

42 How do we determine which t test to use…
Are the scores for the two means from the same subject (or related subjects)? Paired t test (Dependent t-test; Correlated t-test) Yes Are there the same number of people in the two groups? No Equal Variance Independent t test (Pooled Variance Independent t-test) Yes Are the variances of the two groups same? No Equal Variance Independent t test (Pooled Variance Independent t test) yes (Significance Level for Levene (or F-Max) is p<.05 Unequal Variance Independent t-test (Separate Variance Independent t test) No (Significance Level for Levene (or F-Max) is p >=.05

43 Difference of Means Two Parametric Methods
Student’s t-test Assumes two normally distributed populations, and that they have equal variance Welch’s t-test Assumes two normally distributed populations, and they don’t necessarily have equal variance

44 Student’s t-test Student’s t-test assumes that distributions of the two populations have equal but unknown variances. Suppose n1 and n2 samples are randomly and independently selected from two populations, pop1 and pop1, respectively. If each population is normally distributed with the same mean ( µ1=µ2) and with the same variance, then T (the t-statistic), follows a t-distribution with degrees of freedom (df)

45 Student’s t-test T= x̄1− x̄2 𝑆𝑝 √ 1 𝑛1 + 1 𝑛2
Where 𝑆𝑝 = 𝑛1−1 s12+ 𝑛2−1 s22 𝑛1+𝑛2−2 significance level 𝜶=𝟎.𝟎𝟓 degree of freedom df =n1+n2-2 T*- is critical value found using df (from table)

46 the null hypothesis is rejected
Student’s t-test T= x̄1−x̄2 𝑆𝑝 √ 1 𝑛1 + 1 𝑛2 Where 𝑆𝑝 = 𝑛1−1 s12+ 𝑛2−1 s22 𝑛1+𝑛2−2 𝑆𝑝 is pooled variance significance level 𝜶=𝟎.𝟎𝟓 degree of freedom df =n1+n2-2 T*- is critical value found using df (from table) If T> =T* the null hypothesis is rejected

47 Welch’s t-test When the equal population variance assumption is not justified in performing Student’s t-test for the difference of means, Welch’s t-test can be used based on Also known as unequal variances t-test

48 Welch’s t-test Twelch= x̄1− x̄2 √ 𝑠12 𝑛1 + 𝑠22 𝑛2
Where x ̄, s2, n correspond to the sample mean, sample variance, and sample size. Notice that Welch’s t-test uses the sample variance (s2) for each population instead of the pooled sample variance.

49 Example

50 t-test independent samples
Example Some brown hairs were found on the clothing of a victim at a crime scene. The five of the hairs were measured: 46, 57, 54, 51, 38 μm. A suspect is the owner of a shop with similar brown hairs. A sample of the hairs has been taken and their widths measured: 31, 35, 50, 35, 36 μm. Is it possible that the hairs found on the victim were left by the suspect‟s ? Test at the %5 level. [From D. Lucy Introduction to Statistics for Forensic Scientists Chichester: Wiley, 2005 p. 44.]

51 t-test independent samples
1. Calculate the mean and standard deviation for the data sets

52 t-test independent samples
1. Calculate the mean and standard deviation for the data sets A B 46 31 57 35 54 50 51 38 36 Total Mean Standard deviation

53 t-test independent samples
1. Calculate the mean and standard deviation for the data sets Dog A Dog B 46 31 57 35 54 50 51 38 36 Total 246 187 Mean 49.2 37.4 Standard deviation 7.463 7.301

54 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means . 49.2 – 37.4 = 11.8

55 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference .

56 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means.│ Calculate the standard error in the difference .

57 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference . = = ≈ 4.67 (3 sf)

58 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T .

59 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T . T = difference between the means ÷ standard error in the difference

60 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T: . T = difference between the means ÷ standard error in the difference 11.8 4.669 = 2.527 ≈ 2.53 (3 sig fig)

61 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T: Calculate the degrees of freedom .

62 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T: Calculate the degrees of freedom = n1 + n2 - 2 .

63 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T: Calculate the degrees of freedom = n1 + n2 - 2 . = 8

64 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T Calculate the degrees of freedom Find the critical value for the particular significance you are working to from the table .

65 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T: Calculate the degrees of freedom Find the critical value T* for the particular significance you are working to from the table .

66 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of t: Calculate the degrees of freedom Find the critical value for the particular significance you are working to and find the critical value from the table .

67 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of t: Calculate the degrees of freedom Find the critical value for the particular significance you are working to and find the critical value from the table .

68 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of t: Calculate the degrees of freedom Find the critical value for the particular significance you are working to from the table . At the 0.05 level tcrit = 2.306

69 t-test independent samples
Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of t: Calculate the degrees of freedom Find the critical value for the particular significance you are working to and find the critical value from the table . If T < T* (critical value) then there is no significant difference between the two sets of data ,i.e. null hypothesis is Accepted If T >=T* ( critical value) then there is a significant difference between the two sets of data i.e. null hypothesis is Rejected

70 Statistical Methods for Evaluation-
Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA

71 Advantages of Nonparametric Tests
1. Used With All Scales 2. Easier to Compute 3. Make Fewer Assumptions 4. Need Not Involve Population Parameters 5. Results May Be as Exact as Parametric Procedures EPI 809 / Spring 2008 © T/Maker Co.

72 Disadvantages of Nonparametric Tests
1.May Waste Information Parametric model more efficient if data Permit 2.Difficult to Compute by hand for Large Samples 3.Tables Not Widely Available © T/Maker Co. EPI 809 / Spring 2008

73 Popular Nonparametric Tests
1.Sign Test 2.Wilcoxon Rank Sum Test 3.Wilcoxon Signed Rank Test EPI 809 / Spring 2008

74 Wilcoxon Rank Sum Test EPI 809 / Spring 2008 9 47

75 Wilcoxon Rank-Sum Test A Nonparametric Method
Makes no assumptions about the underlying probability distributions

76 Wilcoxon Rank Sum Test 1.Tests Two Independent Population Probability Distributions 2.Corresponds to t-Test for 2 Independent Means 3.Assumptions Independent, Random Samples Populations Are Continuous 4.Can Use Normal Approximation If ni  10 EPI 809 / Spring 2008

77 Wilcoxon Rank Sum Test Procedure
1. Assign Ranks, Ri, to the n1 + n2 Sample Observations If Unequal Sample Sizes, Let n1 Refer to Smaller-Sized Sample Smallest Value = 1 2. Sum the Ranks, Ti, for Each Sample Test Statistic Is TA (Smallest Sample) Null hypothesis: both samples come from the same underlying distribution Distribution of T is not quite as simple as binomial, but it can be computed EPI 809 / Spring 2008

78 Wilcoxon Rank Sum Test Example
You’re a production planner. You want to see if the operating rates for 2 factories is the same. For factory 1, the rates are 71, 82, 77, 92, 88. For factory 2, the rates are 85, 82, 94 & 97. Do the factory rates have the same probability distributions at the .05 level? EPI 809 / Spring 2008 51

79 Wilcoxon Rank Sum Test Solution
H0: Ha:  = n1 = n2 = Critical Value(s): Test Statistic: Decision: Conclusion:  Ranks EPI 809 / Spring 2008

80 Wilcoxon Rank Sum Test Solution
H0: Identical Distrib. Ha: Shifted Left or Right  = n1 = n2 = Critical Value(s): Test Statistic: Decision: Conclusion:  Ranks EPI 809 / Spring 2008

81 Wilcoxon Rank Sum Test Solution
H0: Identical Distrib. Ha: Shifted Left or Right  = .05 n1 = 4 n2 = 5 Critical Value(s): Test Statistic: Decision: Conclusion:  Ranks EPI 809 / Spring 2008

82 Wilcoxon Rank Sum Table 12 (Rosner) (Portion)
 = .05 two-tailed EPI 809 / Spring 2008

83 Wilcoxon Rank Sum Test Solution
H0: Identical Distrib. Ha: Shifted Left or Right  = .10 n1 = 4 n2 = 5 Critical Value(s): Test Statistic: Decision: Conclusion: Do Not Reject Reject Reject 12 28  Ranks EPI 809 / Spring 2008

84 Wilcoxon Rank Sum Test Computation Table
Factory 1 Factory 2 Rate Rank Rate Rank EPI 809 / Spring 2008 Rank Sum

85 Wilcoxon Rank Sum Test Computation Table
Factory 1 Factory 2 Rate Rank Rate Rank 71 85 82 82 77 94 EPI 809 / Spring 2008 92 97 88 ... ... Rank Sum

86 Wilcoxon Rank Sum Test Computation Table
Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 82 82 77 94 EPI 809 / Spring 2008 92 97 88 ... ... Rank Sum

87 Wilcoxon Rank Sum Test Computation Table
Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 82 82 77 2 94 EPI 809 / Spring 2008 92 97 88 ... ... Rank Sum

88 Wilcoxon Rank Sum Test Computation Table
Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 82 3 82 4 77 2 94 EPI 809 / Spring 2008 92 97 88 ... ... Rank Sum

89 Wilcoxon Rank Sum Test Computation Table
Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 82 3 3.5 82 4 3.5 77 2 94 EPI 809 / Spring 2008 92 97 88 ... ... Rank Sum

90 Wilcoxon Rank Sum Test Computation Table
Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 5 82 3 3.5 82 4 3.5 77 2 94 EPI 809 / Spring 2008 92 97 88 ... ... Rank Sum

91 Wilcoxon Rank Sum Test Computation Table
Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 5 82 3 3.5 82 4 3.5 77 2 94 EPI 809 / Spring 2008 92 97 88 6 ... ... Rank Sum

92 Wilcoxon Rank Sum Test Computation Table
Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 5 82 3 3.5 82 4 3.5 77 2 94 EPI 809 / Spring 2008 92 7 97 88 6 ... ... Rank Sum

93 Wilcoxon Rank Sum Test Computation Table
Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 5 82 3 3.5 82 4 3.5 77 2 94 8 EPI 809 / Spring 2008 92 7 97 88 6 ... ... Rank Sum

94 Wilcoxon Rank Sum Test Computation Table
Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 5 82 3 3.5 82 4 3.5 77 2 94 8 EPI 809 / Spring 2008 92 7 97 9 88 6 ... ... Rank Sum

95 Wilcoxon Rank Sum Test Computation Table
Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 5 82 3 3.5 82 4 3.5 77 2 94 8 EPI 809 / Spring 2008 92 7 97 9 88 6 ... ... Rank Sum 19.5 25.5

96 Wilcoxon Rank Sum Test Solution
H0: Identical Distrib. Ha: Shifted Left or Right  = .05 n1 = 4 n2 = 5 Critical Value(s): Test Statistic: Decision: Conclusion: T2 = = 25.5 (Smallest Sample) Do Not Reject Reject Reject 12 28  Ranks EPI 809 / Spring 2008

97 Wilcoxon Rank Sum Test Solution
H0: Identical Distrib. Ha: Shifted Left or Right  = .05 n1 = 4 n2 = 5 Critical Value(s): Test Statistic: Decision: Conclusion: T2 = = 25.5 (Smallest Sample) Do Not Reject at  = .05 Do Not Reject Reject Reject 12 28  Ranks EPI 809 / Spring 2008

98 Wilcoxon Rank Sum Test Solution
H0: Identical Distrib. Ha: Shifted Left or Right  = .05 n1 = 4 n2 = 5 Critical Value(s): Test Statistic: Decision: Conclusion: T2 = = 25.5 (Smallest Sample) Do Not Reject at  = .05 Do Not Reject Reject Reject There is No evidence for unequal distrib 12 28  Ranks EPI 809 / Spring 2008

99 Statistical Methods for Evaluation-
Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA

100 Type I and Type II errors
Type I error refers to the situation when we reject the null hypothesis when it is true (H0 is wrongly rejected). Denoted by 𝜶 Type II error refers to the situation when we accept the null hypothesis when it is false. (H0 is wrongly Accepted). Denoted by 𝜷

101 Type I and Type II errors
Which one is more dangerous Type I or Type II error ? Justify your answer.

102 Statistical Methods for Evaluation-
Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA

103 Power and Sample Size The power of a test is the probability of correctly rejecting the null hypothesis It is denoted by 𝛽 , where (1- 𝛽) is the probability of a type II error. The power of a test improves as the sample size increases power is used to determine the necessary sample size. power of a hypothesis test depends on the true difference of the population means. A larger sample size is required to detect a smaller difference in the means. In general, Effect size d = difference between the means It is important to consider an appropriate effect size for the problem at hand

104 A larger sample size better identifies a fixed effect size
Power and Sample Size A larger sample size better identifies a fixed effect size

105 Statistical Methods for Evaluation-
Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA

106 ANOVA (Analysis of Variance)
A generalization of the hypothesis testing of the difference of two population means Good for analyzing more than two populations ANOVA tests if any of the population means differ from the other population means

107 ANOVA (Analysis of Variance)
Find the mean for each of the groups. Find the overall mean (the mean of the groups combined). Find the Within Group Variation; the total deviation of each member’s score from the Group Mean. Find the Between Group Variation: the deviation of each Group Mean from the Overall Mean. Find the F critical and F statistic: the ratio of Between Group Variation to Within Group Variation. F statistic < F critical accept Ho else reject H0 and accept Ha

108 Syllabus Statistical Methods for Evaluation- Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA Advanced Analytical Theory and Methods: Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

109 Advanced Analytical Theory and Methods
Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

110 Overview of Clustering
Clustering is the use of unsupervised techniques for grouping similar objects Supervised methods use labeled objects Unsupervised methods use unlabeled objects Clustering looks for hidden structure in the data, similarities based on attributes Often used for exploratory analysis No predictions are made

111 General Applications of Clustering
Pattern Recognition Spatial Data Analysis create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns July 2, 2019 Data Mining: Concepts and Techniques

112 Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults July 2, 2019 Data Mining: Concepts and Techniques

113 CLUSTERING Cluster: a collection of data objects similar to one another within the same cluster Dissimilar to the objects in other clusters The distance between points in a cluster is less than the distance between a point in the cluster and any point outside it Data can be clustered on different attributes Clustering differs from classification Unsupervised learning No predefined classes (no a priori knowledge)

114 + Cluster analysis: Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters

115 Clustering Methods Medoid Centroid
Given a cluster Km of N points { tm1,tm2, tmk} , the centroid or middle of the cluster computed as Centroid = Cm = ∑ tmi / N is considered as the representative of the cluster (there may not be any corresponding object) Some algorithms use as representative a centrally located object called Medoid Medoid Centroid

116 Advanced Analytical Theory and Methods
Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

117 K-means Algorithm Given a collection of objects each with n measurable attributes and a chosen value k that is the number of clusters, the algorithm identifies the k clusters of objects based on the objects proximity to the centers of the k groups. The algorithm is iterative with the centers adjusted to the mean of each cluster’s n-dimensional vector of attributes

118 Use Cases Clustering is often used as a lead-in to classification, where labels are applied to the identified clusters Some applications Image processing With security images, successive frames are examined for change Medical Patients can be grouped to identify naturally occurring clusters Customer segmentation Marketing and sales groups identify customers having similar behaviors and spending patterns

119 Advanced Analytical Theory and Methods
Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

120 K-Means Example Given: {2,4,10,12,3,20,30,11,25}, k=2
Randomly assign means: m1=3,m2=4 K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16 K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18 K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6 K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25 Stop as the clusters with these means are the same.

121 K-means Method Four Steps
Choose the value of k and the initial guesses for the centroids Compute the distance from each data point to each centroid, and assign each point to the closest centroid Compute the centroid of each newly defined cluster from step 2 Repeat steps 2 and 3 until the algorithm converges (no changes occur)

122 K-means Method- for two dimension Example – Step 1
Choose the value of k and the k initial guesses for the centroids. In this example, k = 3, and the initial centroids are indicated by the points shaded in red, green, and blue

123 K-means Method- for two dimension Example – Step 2
Points are assigned to the closest centroid. In two dimensions, the distance, d, between any two points,(x1,y1) and (x2,y2) is expressed by Euclidean distance measure :√ (𝒙𝟏−𝒙𝟐) 𝟐 + (𝒚𝟏−𝒚𝟐) 𝟐

124 K-means Method- for two dimension Example – Step 3
Compute centroids of the new clusters. In two dimensions, the centroid (Xc,Yc) of the m points is calculated as follows (Xc,Yc)= 𝒊=𝟏 𝒎 𝑿𝒊 𝒎 , 𝒊=𝟏 𝒎 𝒀𝒊 𝒎

125 K-means Method- for two dimension Example – Step 4
Repeat steps 2 and 3 until convergence Convergence occurs when the centroids do not change or when the centroids oscillate back and forth This can occur when one or more points have equal distances from the centroid centers Videos

126 K-means - for n dimension
To generalize the prior algorithm to n dimensions, suppose there are M objects, where each object is described by n attributes or property values (P1,P2,….,Pn). Then object i is described by for (Pi1,Pi2,….,Pin) for i= 1,2,..., M. For a given point, Pi, at (Pi1,Pi2,….,Pin) and a centroid, q, located at (q1,q2,….qn), the distance, d, between Pi and q, is expressed as shown in 𝑑 𝑃𝑖,𝑞 = √ 𝑗=1 𝑛 (𝑃𝑖𝑗−𝑞𝑗) 2 The centroid q of a cluster of m points, (Pi1,Pi2,….,Pin) , is calculated as shown in (q1,q2,…qn) = 𝑖=1 𝑚 𝑃𝑖1 𝑚 , 𝑖=1 𝑚 𝑃𝑖2 𝑚 , …… 𝑖=1 𝑚 𝑃𝑖𝑛 𝑚

127 Advanced Analytical Theory and Methods
Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

128 Determining Number of Clusters
k clusters can be identified in a given dataset, but what value of k should be selected? The value of k can be chosen based on a reasonable guess or some predefined requirement. How to know better or worse having k clusters versus k – 1 or k + 1 clusters Solution: Use heuristic – e.g., Within Sum of Squares (WSS) WSS metric is the sum of the squares of the distances between each data point and the closest centroid The process of identifying the appropriate value of k is referred to as finding the “elbow” of the WSS curve

129 Determining Number of Clusters (WSS Method)
Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters. For each k, calculate the total within-cluster sum of square (WSS). Plot the curve of WSS according to the number of clusters k. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters. where: xi -is a data point belonging to the cluster Ck μk is the mean value of the points assigned to the cluster Ck

130 Determining Number of Clusters Example of WSS vs #Clusters curve
The elbow of the curve appears to occur at k = 3.

131 Advanced Analytical Theory and Methods
Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

132 Diagnostics When the number of clusters is small, plotting the data helps refine the choice of k The following questions should be considered Are the clusters well separated from each other? Do any of the clusters have only a few points Do any of the centroids appear to be too close to each other?

133 Diagnostics Example of distinct clusters

134 Diagnostics Example of less obvious clusters

135 Diagnostics Six clusters from points of previous figure

136 Advanced Analytical Theory and Methods
Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

137 Reasons to Choose and Cautions
Decisions the practitioner must make What object attributes should be included in the analysis? What unit of measure should be used for each attribute? Do the attributes need to be rescaled? What other considerations might apply?

138 Reasons to Choose and Cautions Object Attributes
Important to understand what attributes will be known at the time a new object is assigned to a cluster E.g., information on existing customers’ satisfaction or purchase frequency may be available, but such information may not be available for potential customers . Eg. information like age and income of existing customers is available but may not be available, for new customers Best to reduce number of attributes when possible Too many attributes minimize the impact of key variables Identify highly correlated attributes for reduction Combine several attributes into one: e.g., debt/asset ratio

139 Reasons to Choose and Cautions Object attributes: scatterplot matrix for seven attributes

140 Reasons to Choose and Cautions Units of Measure
K-means algorithm will identify different clusters depending on the units of measure k = 2

141 Reasons to Choose and Cautions Units of Measure
Age dominates k = 2

142 Reasons to Choose and Cautions Rescaling
Rescaling can reduce domination effect E.g., divide each variable by the appropriate standard deviation Rescaled attributes k = 2

143 Reasons to Choose and Cautions Additional Considerations
K-means sensitive to starting seeds Important to rerun with several seeds – R has the nstart option Could explore distance metrics other than Euclidean E.g., Manhattan, Mahalanobis, etc. K-means is easily applied to numeric data and does not work well with nominal attributes E.g., color

144 Additional Algorithms
K-modes clustering kmod() Partitioning around Medoids (PAM) pam() Hierarchical agglomerative clustering hclust()

145 Summary Properly scale the attribute values to avoid domination
Clustering analysis groups similar objects based on the objects’ attributes To use k-means properly, it is important to Properly scale the attribute values to avoid domination Assure the concept of distance between the assigned values of an attribute is meaningful Carefully choose the number of clusters, k Once the clusters are identified, it is often useful to label them in a descriptive way

146 References


Download ppt "Data Analytics (BE-2015 Pattern) Unit II Basic Data Analytic Methods"

Similar presentations


Ads by Google