Data Analytics (BE-2015 Pattern) Unit II Basic Data Analytic Methods

Slides:



Advertisements
Similar presentations
Request Dispatching for Cheap Energy Prices in Cloud Data Centers
Advertisements

SpringerLink Training Kit
Luminosity measurements at Hadron Colliders
From Word Embeddings To Document Distances
Choosing a Dental Plan Student Name
Virtual Environments and Computer Graphics
Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI
THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –
D. Phát triển thương hiệu
NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN
Điều trị chống huyết khối trong tai biến mạch máu não
BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.
Nasal Cannula X particulate mask
Evolving Architecture for Beyond the Standard Model
HF NOISE FILTERS PERFORMANCE
Electronics for Pedestrians – Passive Components –
Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel
L-Systems and Affine Transformations
CMSC423: Bioinformatic Algorithms, Databases and Tools
Some aspect concerning the LMDZ dynamical core and its use
Bayesian Confidence Limits and Intervals
实习总结 (Internship Summary)
Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,
Front End Electronics for SOI Monolithic Pixel Sensor
Face Recognition Monday, February 1, 2016.
Solving Rubik's Cube By: Etai Nativ.
CS284 Paper Presentation Arpad Kovacs
انتقال حرارت 2 خانم خسرویار.
Summer Student Program First results
Theoretical Results on Neutrinos
HERMESでのHard Exclusive生成過程による 核子内クォーク全角運動量についての研究
Wavelet Coherence & Cross-Wavelet Transform
yaSpMV: Yet Another SpMV Framework on GPUs
Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.
MOCLA02 Design of a Compact L-­band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Fuel cell development program for electric vehicle
Overview of TST-2 Experiment
Optomechanics with atoms
داده کاوی سئوالات نمونه
Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium  
ლექცია 4 - ფული და ინფლაცია
10. predavanje Novac i financijski sustav
Wissenschaftliche Aussprache zur Dissertation
FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,
Particle acceleration during the gamma-ray flares of the Crab Nebular
Interpretations of the Derivative Gottfried Wilhelm Leibniz
Advisor: Chiuyuan Chen Student: Shao-Chun Lin
Widow Rockfish Assessment
SiW-ECAL Beam Test 2015 Kick-Off meeting
On Robust Neighbor Discovery in Mobile Wireless Networks
Chapter 6 并发:死锁和饥饿 Operating Systems: Internals and Design Principles
You NEED your book!!! Frequency Distribution
Y V =0 a V =V0 x b b V =0 z
Fairness-oriented Scheduling Support for Multicore Systems
Climate-Energy-Policy Interaction
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Ch48 Statistics by Chtan FYHSKulai
The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.
Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs
Online Learning: An Introduction
Factor Based Index of Systemic Stress (FISS)
What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.
THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*
Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.
The Toroidal Sporadic Source: Understanding Temporal Variations
FW 3.4: More Circle Practice
ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف
Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM
Limits on Anomalous WWγ and WWZ Couplings from DØ
Presentation transcript:

Data Analytics (BE-2015 Pattern) Unit II Basic Data Analytic Methods By Prof. B.A.Khivsara Note: The material to prepare this presentation has been taken from internet and are generated only for students reference and not for commercial use.

Syllabus Statistical Methods for Evaluation- Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA Advanced Analytical Theory and Methods: Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

Syllabus Statistical Methods for Evaluation- Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA Advanced Analytical Theory and Methods: Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

Statistical Methods for Evaluation- Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA

What is Hypothesis? A hypothesis is an educated guess about something in the world around you. It should be testable, either by experiment or observation. For example: A new medicine you think might work. A way of teaching you think might be better.

What is a Hypothesis Statement? Hypothesis statement will look like this: “If I…(do this to an independent variable)….then (this will happen to the dependent variable).” For example: If I (decrease the amount of water given to herbs) then (the herbs will increase in size). If I (give patients counseling in addition to medication) then (their overall depression scale will decrease).

What is Hypothesis Testing? Hypothesis testing refers to Making an assumption, called hypothesis, about a population parameter. Collecting sample data. Calculating a sample statistic. Using the sample statistic to evaluate the hypothesis

Hypothesis Testing :Population & sample

Hypothesis Testing HYPOTHESIS TESTING Alternative hypothesis,HA Null hypothesis, H0 Alternative hypothesis,HA State the hypothesized value of the parameter before sampling. The assumption we wish to test (or trying to reject) E.g µ = 20 There is no difference between coke and diet coke All possible alternatives other than the null hypothesis. E.g µ≠20 µ >20 µ < 20 There is a difference between coke and diet coke

Hypothesis Testing Basic concept is to form an assertion and test it with data Common assumption is that there is no difference between samples (default assumption) Statisticians refer to this as the null hypothesis (H0) The alternative hypothesis (HA) is that there is a difference between samples

What is the Null & alternate Hypothesis? The null hypothesis is always the accepted fact or accepted as being true are: DNA is shaped like a double helix. There are 8 planets in the solar system (excluding Pluto). Given a population, the initial (assumed) hypothesis to be tested ,Ho , is called the null hypothesis. Rejection of null hypothesis causes another hypothesis,H1,is called the alternative hypothesis, to be made.

Statistical Methods for Evaluation- Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA

mean, variance , standard deviation μ if working with population X̄ if working with samples Mean (or Average) denoted by σ2 (for population) s2 (for sample) Variance denoted by σX or σ (for population) sX or s (for sample)) Standard deviation denoted by

Mean – is a simple average of given data values Example 4,5,9,2,14,6 Mean X̄= (4+ 5+9+3+15+6) /6 = 42/6 = 7

Variance: a measure of how data points differ from the mean Data Set 1: 3, 5, 7, 10, 10 Data Set 2: 7, 7, 7, 7, 7 What is the mean of the above data set? Data Set 1: mean = 7 Data Set 2: mean = 7 But we know that the two data sets are not identical! The variance shows how they are different. We want to find a way to represent these two data set numerically.

How to Calculate variance? If we conceptualize the spread of a distribution as the extent to which the values in the distribution differ from the mean and from each other, then a reasonable measure of spread might be the average deviation, or difference, of the values from the mean.

How to Calculate variance? The average of the squared deviations about the mean is called the variance. For population variance For sample variance

Example 1- Variance The mean is 35/5 = 7. 35 Total 3 5 7 10 Score X ( )2 1 3 2 5 7 4 10 Total 35 The mean is 35/5 = 7.

Example 1- Variance 3 3-7=-4 5 5-7=-2 7 7-7=0 10 10-7=3 Totals 35 Score X ( )2 1 3 3-7=-4 2 5 5-7=-2 7 7-7=0 4 10 10-7=3 Totals 35

Example 1- Variance Totals 3 3-7=-4 16 5 5-7=-2 4 7 7-7=0 10 10-7=3 9 Score X ( )2 1 3 3-7=-4 16 2 5 5-7=-2 4 7 7-7=0 10 10-7=3 9 Totals 35 38

Example 1- Variance 3 3-7=-4 16 5 5-7=-2 4 7 7-7=0 10 10-7=3 9 Totals Score X ( )2 1 3 3-7=-4 16 2 5 5-7=-2 4 7 7-7=0 10 10-7=3 9 Totals 35 38

Example 1- Variance Score X ( )2 1 7 7-7=0 2 3 4 5 Totals 35 0/5 =0

Example 2- Variance Drive Mark Myrna 1 28 27 2 22 27 3 21 28 4 26 6 5 18 27 Which diver was more consistent?

Example 2- Variance Mark’s Variance = 64 / 5 = 12.8 Dive Mark's Score X ( )2 1 28 5 25 2 22 -1 3 21 -2 4 26 9 18 -5 Totals 115 64 Mark’s Variance = 64 / 5 = 12.8 Myrna’s Variance = 362 / 5 = 72.4 Conclusion: Mark has a lower variance therefore he is more consistent.

standard deviation - a measure of variation of scores about the mean Can think of standard deviation as the average distance to the mean Higher standard deviation indicates higher spread, less consistency, and less clustering. sample standard deviation: population standard deviation:

Example – Standard Deviation Dive Mark's Score X ( )2 1 28 5 25 2 22 -1 3 21 -2 4 26 9 18 -5 Totals 115 64 Mark’s Variance = 64 / 5 = 12.8 Mark’s Standard Deviation for population = 𝟏𝟐.𝟖 𝟓 =𝟏.𝟔 Mark’s Standard Deviation for sample 𝟏𝟐.𝟖 𝟒 =𝟏.78

Example- Variance & Standard Deviation You have just measured the heights of your dogs (in mm) The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm. Find out the Mean, the Variance, and the Standard Deviation.

Example- Variance & Standard Deviation Your first step is to find the Mean: Mean = (600 + 470 + 170 + 430 + 300)5 Mean = 1970/5 Mean = 394

Example- Variance & Standard Deviation Now we calculate each dog's difference from the Mean

Example- Variance & Standard Deviation To calculate the Variance, take each difference, square it, and then average the result: Variance So the Variance σ2 is 21,704 σ2 = 2062 + 762 + (−224)2 + 362 + (−94)2 / 5   42436 + 5776 + 50176 + 1296 + 8836 / 5 108520 / 5 21704

Example- Variance & Standard Deviation And the Standard Deviation is just the square root of Variance, so: Standard Deviation σ = √21704   147.32... 147 (to the nearest mm)

Example- Variance & Standard Deviation And the good thing about the Standard Deviation is that it is useful. Now we can show which heights are within one Standard Deviation (147mm) of the Mean: So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is extra large or extra small.

difference of means State the hypotheses Formulate an analysis plan Analyze sample data using hypothesis test Interpret results.

Hypothesis Testing Procedures EPI 809 / Spring 2008 Many More Tests Exist! 12

Parametric Test Procedures 1.Involve Population Parameters (Mean) 2.Have Stringent(strict) Assumptions (Normality) 3.Examples: Z Test, t Test, c2 Test, F test EPI 809 / Spring 2008

Nonparametric Test Procedures 1. Do Not Involve Population Parameters Example: Probability Distributions, Independence 2. Data Measured on Any Scale (Ratio or Interval, Ordinal or Nominal) 3. Example: Wilcoxon Rank Sum Test EPI 809 / Spring 2008

Parametric Test Procedures EPI 809 / Spring 2008

A t test allows us to compare the means of two groups The calculations for a t test requires three pieces of information: - the difference between the means (mean difference) - the standard deviation for each group - and the number of subjects(samples) in each group. 10 9 8 7 6 5 4 3 2 1 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Spelling Test Scores

The size of the standard deviation also influences the outcome of a t test. are more likely to report a significant difference than groups with larger standard deviations. Given the same difference in means, groups with smaller standard deviations 12 13 14 15 16 17 18 19 20 21 22 23 24 25 10 9 8 7 6 5 4 3 2 1 Spelling Test Scores 12 13 14 15 16 17 18 19 20 21 22 23 24 25 10 9 8 7 6 5 4 3 2 1 Spelling Test Scores

than larger standard deviations. Less overlap would indicate that the groups are more different from each other. From a practical standpoint, we can see that smaller standard deviations produce less overlap between the groups than larger standard deviations. 12 13 14 15 16 17 18 19 20 21 22 23 24 25 10 9 8 7 6 5 4 3 2 1 Spelling Test Scores 12 13 14 15 16 17 18 19 20 21 22 23 24 25 10 9 8 7 6 5 4 3 2 1 Spelling Test Scores

Difference of Means Two populations – same or different?

How do we determine which t test to use… Are the scores for the two means from the same subject (or related subjects)? Paired t test (Dependent t-test; Correlated t-test) Yes Are there the same number of people in the two groups? No Equal Variance Independent t test (Pooled Variance Independent t-test) Yes Are the variances of the two groups same? No Equal Variance Independent t test (Pooled Variance Independent t test) yes (Significance Level for Levene (or F-Max) is p<.05 Unequal Variance Independent t-test (Separate Variance Independent t test) No (Significance Level for Levene (or F-Max) is p >=.05

Difference of Means Two Parametric Methods Student’s t-test Assumes two normally distributed populations, and that they have equal variance Welch’s t-test Assumes two normally distributed populations, and they don’t necessarily have equal variance

Student’s t-test Student’s t-test assumes that distributions of the two populations have equal but unknown variances. Suppose n1 and n2 samples are randomly and independently selected from two populations, pop1 and pop1, respectively. If each population is normally distributed with the same mean ( µ1=µ2) and with the same variance, then T (the t-statistic), follows a t-distribution with degrees of freedom (df)

Student’s t-test T= x̄1− x̄2 𝑆𝑝 √ 1 𝑛1 + 1 𝑛2 Where 𝑆𝑝 = 𝑛1−1 s12+ 𝑛2−1 s22 𝑛1+𝑛2−2 significance level 𝜶=𝟎.𝟎𝟓 degree of freedom df =n1+n2-2 T*- is critical value found using df (from table)

the null hypothesis is rejected Student’s t-test T= x̄1−x̄2 𝑆𝑝 √ 1 𝑛1 + 1 𝑛2 Where 𝑆𝑝 = 𝑛1−1 s12+ 𝑛2−1 s22 𝑛1+𝑛2−2 𝑆𝑝 is pooled variance significance level 𝜶=𝟎.𝟎𝟓 degree of freedom df =n1+n2-2 T*- is critical value found using df (from table) If T> =T* the null hypothesis is rejected

Welch’s t-test When the equal population variance assumption is not justified in performing Student’s t-test for the difference of means, Welch’s t-test can be used based on Also known as unequal variances t-test

Welch’s t-test Twelch= x̄1− x̄2 √ 𝑠12 𝑛1 + 𝑠22 𝑛2 Where x ̄, s2, n correspond to the sample mean, sample variance, and sample size. Notice that Welch’s t-test uses the sample variance (s2) for each population instead of the pooled sample variance.

Example

t-test independent samples Example Some brown hairs were found on the clothing of a victim at a crime scene. The five of the hairs were measured: 46, 57, 54, 51, 38 μm. A suspect is the owner of a shop with similar brown hairs. A sample of the hairs has been taken and their widths measured: 31, 35, 50, 35, 36 μm. Is it possible that the hairs found on the victim were left by the suspect‟s ? Test at the %5 level. [From D. Lucy Introduction to Statistics for Forensic Scientists Chichester: Wiley, 2005 p. 44.]

t-test independent samples 1. Calculate the mean and standard deviation for the data sets

t-test independent samples 1. Calculate the mean and standard deviation for the data sets A B 46 31 57 35 54 50 51 38 36 Total Mean Standard deviation

t-test independent samples 1. Calculate the mean and standard deviation for the data sets Dog A Dog B 46 31 57 35 54 50 51 38 36 Total 246 187 Mean 49.2 37.4 Standard deviation 7.463 7.301

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means . 49.2 – 37.4 = 11.8

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference .

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means.│ Calculate the standard error in the difference .

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference . = = 4.669 ≈ 4.67 (3 sf)

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T .

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T . T = difference between the means ÷ standard error in the difference

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T: . T = difference between the means ÷ standard error in the difference 11.8 4.669 = 2.527 ≈ 2.53 (3 sig fig)

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T: Calculate the degrees of freedom .

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T: Calculate the degrees of freedom = n1 + n2 - 2 .

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T: Calculate the degrees of freedom = n1 + n2 - 2 . 5 + 5 - 2 = 8

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T Calculate the degrees of freedom Find the critical value for the particular significance you are working to from the table .

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of T: Calculate the degrees of freedom Find the critical value T* for the particular significance you are working to from the table .

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of t: Calculate the degrees of freedom Find the critical value for the particular significance you are working to and find the critical value from the table .

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of t: Calculate the degrees of freedom Find the critical value for the particular significance you are working to and find the critical value from the table .

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of t: Calculate the degrees of freedom Find the critical value for the particular significance you are working to from the table . At the 0.05 level tcrit = 2.306

t-test independent samples Calculate the mean and standard deviation for the data sets Calculate the magnitude of the difference between the two means Calculate the standard error in the difference Calculate the value of t: Calculate the degrees of freedom Find the critical value for the particular significance you are working to and find the critical value from the table . If T < T* (critical value) then there is no significant difference between the two sets of data ,i.e. null hypothesis is Accepted If T >=T* ( critical value) then there is a significant difference between the two sets of data i.e. null hypothesis is Rejected

Statistical Methods for Evaluation- Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA

Advantages of Nonparametric Tests 1. Used With All Scales 2. Easier to Compute 3. Make Fewer Assumptions 4. Need Not Involve Population Parameters 5. Results May Be as Exact as Parametric Procedures EPI 809 / Spring 2008 © 1984-1994 T/Maker Co.

Disadvantages of Nonparametric Tests 1.May Waste Information Parametric model more efficient if data Permit 2.Difficult to Compute by hand for Large Samples 3.Tables Not Widely Available © 1984-1994 T/Maker Co. EPI 809 / Spring 2008

Popular Nonparametric Tests 1.Sign Test 2.Wilcoxon Rank Sum Test 3.Wilcoxon Signed Rank Test EPI 809 / Spring 2008

Wilcoxon Rank Sum Test EPI 809 / Spring 2008 9 47

Wilcoxon Rank-Sum Test A Nonparametric Method Makes no assumptions about the underlying probability distributions

Wilcoxon Rank Sum Test 1.Tests Two Independent Population Probability Distributions 2.Corresponds to t-Test for 2 Independent Means 3.Assumptions Independent, Random Samples Populations Are Continuous 4.Can Use Normal Approximation If ni  10 EPI 809 / Spring 2008

Wilcoxon Rank Sum Test Procedure 1. Assign Ranks, Ri, to the n1 + n2 Sample Observations If Unequal Sample Sizes, Let n1 Refer to Smaller-Sized Sample Smallest Value = 1 2. Sum the Ranks, Ti, for Each Sample Test Statistic Is TA (Smallest Sample) Null hypothesis: both samples come from the same underlying distribution Distribution of T is not quite as simple as binomial, but it can be computed EPI 809 / Spring 2008

Wilcoxon Rank Sum Test Example You’re a production planner. You want to see if the operating rates for 2 factories is the same. For factory 1, the rates are 71, 82, 77, 92, 88. For factory 2, the rates are 85, 82, 94 & 97. Do the factory rates have the same probability distributions at the .05 level? EPI 809 / Spring 2008 51

Wilcoxon Rank Sum Test Solution H0: Ha:  = n1 = n2 = Critical Value(s): Test Statistic: Decision: Conclusion:  Ranks EPI 809 / Spring 2008

Wilcoxon Rank Sum Test Solution H0: Identical Distrib. Ha: Shifted Left or Right  = n1 = n2 = Critical Value(s): Test Statistic: Decision: Conclusion:  Ranks EPI 809 / Spring 2008

Wilcoxon Rank Sum Test Solution H0: Identical Distrib. Ha: Shifted Left or Right  = .05 n1 = 4 n2 = 5 Critical Value(s): Test Statistic: Decision: Conclusion:  Ranks EPI 809 / Spring 2008

Wilcoxon Rank Sum Table 12 (Rosner) (Portion)  = .05 two-tailed EPI 809 / Spring 2008

Wilcoxon Rank Sum Test Solution H0: Identical Distrib. Ha: Shifted Left or Right  = .10 n1 = 4 n2 = 5 Critical Value(s): Test Statistic: Decision: Conclusion: Do Not Reject Reject Reject 12 28  Ranks EPI 809 / Spring 2008

Wilcoxon Rank Sum Test Computation Table Factory 1 Factory 2 Rate Rank Rate Rank EPI 809 / Spring 2008 Rank Sum

Wilcoxon Rank Sum Test Computation Table Factory 1 Factory 2 Rate Rank Rate Rank 71 85 82 82 77 94 EPI 809 / Spring 2008 92 97 88 ... ... Rank Sum

Wilcoxon Rank Sum Test Computation Table Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 82 82 77 94 EPI 809 / Spring 2008 92 97 88 ... ... Rank Sum

Wilcoxon Rank Sum Test Computation Table Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 82 82 77 2 94 EPI 809 / Spring 2008 92 97 88 ... ... Rank Sum

Wilcoxon Rank Sum Test Computation Table Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 82 3 82 4 77 2 94 EPI 809 / Spring 2008 92 97 88 ... ... Rank Sum

Wilcoxon Rank Sum Test Computation Table Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 82 3 3.5 82 4 3.5 77 2 94 EPI 809 / Spring 2008 92 97 88 ... ... Rank Sum

Wilcoxon Rank Sum Test Computation Table Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 5 82 3 3.5 82 4 3.5 77 2 94 EPI 809 / Spring 2008 92 97 88 ... ... Rank Sum

Wilcoxon Rank Sum Test Computation Table Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 5 82 3 3.5 82 4 3.5 77 2 94 EPI 809 / Spring 2008 92 97 88 6 ... ... Rank Sum

Wilcoxon Rank Sum Test Computation Table Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 5 82 3 3.5 82 4 3.5 77 2 94 EPI 809 / Spring 2008 92 7 97 88 6 ... ... Rank Sum

Wilcoxon Rank Sum Test Computation Table Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 5 82 3 3.5 82 4 3.5 77 2 94 8 EPI 809 / Spring 2008 92 7 97 88 6 ... ... Rank Sum

Wilcoxon Rank Sum Test Computation Table Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 5 82 3 3.5 82 4 3.5 77 2 94 8 EPI 809 / Spring 2008 92 7 97 9 88 6 ... ... Rank Sum

Wilcoxon Rank Sum Test Computation Table Factory 1 Factory 2 Rate Rank Rate Rank 71 1 85 5 82 3 3.5 82 4 3.5 77 2 94 8 EPI 809 / Spring 2008 92 7 97 9 88 6 ... ... Rank Sum 19.5 25.5

Wilcoxon Rank Sum Test Solution H0: Identical Distrib. Ha: Shifted Left or Right  = .05 n1 = 4 n2 = 5 Critical Value(s): Test Statistic: Decision: Conclusion: T2 = 5 + 3.5 + 8+ 9 = 25.5 (Smallest Sample) Do Not Reject Reject Reject 12 28  Ranks EPI 809 / Spring 2008

Wilcoxon Rank Sum Test Solution H0: Identical Distrib. Ha: Shifted Left or Right  = .05 n1 = 4 n2 = 5 Critical Value(s): Test Statistic: Decision: Conclusion: T2 = 5 + 3.5 + 8+ 9 = 25.5 (Smallest Sample) Do Not Reject at  = .05 Do Not Reject Reject Reject 12 28  Ranks EPI 809 / Spring 2008

Wilcoxon Rank Sum Test Solution H0: Identical Distrib. Ha: Shifted Left or Right  = .05 n1 = 4 n2 = 5 Critical Value(s): Test Statistic: Decision: Conclusion: T2 = 5 + 3.5 + 8+ 9 = 25.5 (Smallest Sample) Do Not Reject at  = .05 Do Not Reject Reject Reject There is No evidence for unequal distrib 12 28  Ranks EPI 809 / Spring 2008

Statistical Methods for Evaluation- Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA

Type I and Type II errors Type I error refers to the situation when we reject the null hypothesis when it is true (H0 is wrongly rejected). Denoted by 𝜶 Type II error refers to the situation when we accept the null hypothesis when it is false. (H0 is wrongly Accepted). Denoted by 𝜷

Type I and Type II errors Which one is more dangerous Type I or Type II error ? Justify your answer.

Statistical Methods for Evaluation- Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA

Power and Sample Size The power of a test is the probability of correctly rejecting the null hypothesis It is denoted by 𝛽 , where (1- 𝛽) is the probability of a type II error. The power of a test improves as the sample size increases power is used to determine the necessary sample size. power of a hypothesis test depends on the true difference of the population means. A larger sample size is required to detect a smaller difference in the means. In general, Effect size d = difference between the means It is important to consider an appropriate effect size for the problem at hand

A larger sample size better identifies a fixed effect size Power and Sample Size A larger sample size better identifies a fixed effect size

Statistical Methods for Evaluation- Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA

ANOVA (Analysis of Variance) A generalization of the hypothesis testing of the difference of two population means Good for analyzing more than two populations ANOVA tests if any of the population means differ from the other population means

ANOVA (Analysis of Variance) Find the mean for each of the groups. Find the overall mean (the mean of the groups combined). Find the Within Group Variation; the total deviation of each member’s score from the Group Mean. Find the Between Group Variation: the deviation of each Group Mean from the Overall Mean. Find the F critical and F statistic: the ratio of Between Group Variation to Within Group Variation. F statistic < F critical accept Ho else reject H0 and accept Ha

Syllabus Statistical Methods for Evaluation- Hypothesis testing, difference of means, wilcoxon rank–sum test, type 1 type 2 errors, power and sample size, ANNOVA Advanced Analytical Theory and Methods: Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

Advanced Analytical Theory and Methods Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

Overview of Clustering Clustering is the use of unsupervised techniques for grouping similar objects Supervised methods use labeled objects Unsupervised methods use unlabeled objects Clustering looks for hidden structure in the data, similarities based on attributes Often used for exploratory analysis No predictions are made

General Applications of Clustering Pattern Recognition Spatial Data Analysis create thematic maps in GIS by clustering feature spaces detect spatial clusters and explain them in spatial data mining Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns July 2, 2019 Data Mining: Concepts and Techniques

Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults July 2, 2019 Data Mining: Concepts and Techniques

CLUSTERING Cluster: a collection of data objects similar to one another within the same cluster Dissimilar to the objects in other clusters The distance between points in a cluster is less than the distance between a point in the cluster and any point outside it Data can be clustered on different attributes Clustering differs from classification Unsupervised learning No predefined classes (no a priori knowledge)

+ Cluster analysis: Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters

Clustering Methods Medoid Centroid Given a cluster Km of N points { tm1,tm2, tmk} , the centroid or middle of the cluster computed as Centroid = Cm = ∑ tmi / N is considered as the representative of the cluster (there may not be any corresponding object) Some algorithms use as representative a centrally located object called Medoid Medoid Centroid

Advanced Analytical Theory and Methods Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

K-means Algorithm Given a collection of objects each with n measurable attributes and a chosen value k that is the number of clusters, the algorithm identifies the k clusters of objects based on the objects proximity to the centers of the k groups. The algorithm is iterative with the centers adjusted to the mean of each cluster’s n-dimensional vector of attributes

Use Cases Clustering is often used as a lead-in to classification, where labels are applied to the identified clusters Some applications Image processing With security images, successive frames are examined for change Medical Patients can be grouped to identify naturally occurring clusters Customer segmentation Marketing and sales groups identify customers having similar behaviors and spending patterns

Advanced Analytical Theory and Methods Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

K-Means Example Given: {2,4,10,12,3,20,30,11,25}, k=2 Randomly assign means: m1=3,m2=4 K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16 K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18 K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6 K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25 Stop as the clusters with these means are the same.

K-means Method Four Steps Choose the value of k and the initial guesses for the centroids Compute the distance from each data point to each centroid, and assign each point to the closest centroid Compute the centroid of each newly defined cluster from step 2 Repeat steps 2 and 3 until the algorithm converges (no changes occur)

K-means Method- for two dimension Example – Step 1 Choose the value of k and the k initial guesses for the centroids. In this example, k = 3, and the initial centroids are indicated by the points shaded in red, green, and blue

K-means Method- for two dimension Example – Step 2 Points are assigned to the closest centroid. In two dimensions, the distance, d, between any two points,(x1,y1) and (x2,y2) is expressed by Euclidean distance measure :√ (𝒙𝟏−𝒙𝟐) 𝟐 + (𝒚𝟏−𝒚𝟐) 𝟐

K-means Method- for two dimension Example – Step 3 Compute centroids of the new clusters. In two dimensions, the centroid (Xc,Yc) of the m points is calculated as follows (Xc,Yc)= 𝒊=𝟏 𝒎 𝑿𝒊 𝒎 , 𝒊=𝟏 𝒎 𝒀𝒊 𝒎

K-means Method- for two dimension Example – Step 4 Repeat steps 2 and 3 until convergence Convergence occurs when the centroids do not change or when the centroids oscillate back and forth This can occur when one or more points have equal distances from the centroid centers Videos http://www.youtube.com/watch?v=aiJ8II94qck https://class.coursera.org/ml-003/lecture/78

K-means - for n dimension To generalize the prior algorithm to n dimensions, suppose there are M objects, where each object is described by n attributes or property values (P1,P2,….,Pn). Then object i is described by for (Pi1,Pi2,….,Pin) for i= 1,2,..., M. For a given point, Pi, at (Pi1,Pi2,….,Pin) and a centroid, q, located at (q1,q2,….qn), the distance, d, between Pi and q, is expressed as shown in 𝑑 𝑃𝑖,𝑞 = √ 𝑗=1 𝑛 (𝑃𝑖𝑗−𝑞𝑗) 2 The centroid q of a cluster of m points, (Pi1,Pi2,….,Pin) , is calculated as shown in (q1,q2,…qn) = 𝑖=1 𝑚 𝑃𝑖1 𝑚 , 𝑖=1 𝑚 𝑃𝑖2 𝑚 , …… 𝑖=1 𝑚 𝑃𝑖𝑛 𝑚

Advanced Analytical Theory and Methods Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

Determining Number of Clusters k clusters can be identified in a given dataset, but what value of k should be selected? The value of k can be chosen based on a reasonable guess or some predefined requirement. How to know better or worse having k clusters versus k – 1 or k + 1 clusters Solution: Use heuristic – e.g., Within Sum of Squares (WSS) WSS metric is the sum of the squares of the distances between each data point and the closest centroid The process of identifying the appropriate value of k is referred to as finding the “elbow” of the WSS curve

Determining Number of Clusters (WSS Method) Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters. For each k, calculate the total within-cluster sum of square (WSS). Plot the curve of WSS according to the number of clusters k. The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters. where: xi -is a data point belonging to the cluster Ck μk is the mean value of the points assigned to the cluster Ck

Determining Number of Clusters Example of WSS vs #Clusters curve The elbow of the curve appears to occur at k = 3.

Advanced Analytical Theory and Methods Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

Diagnostics When the number of clusters is small, plotting the data helps refine the choice of k The following questions should be considered Are the clusters well separated from each other? Do any of the clusters have only a few points Do any of the centroids appear to be too close to each other?

Diagnostics Example of distinct clusters

Diagnostics Example of less obvious clusters

Diagnostics Six clusters from points of previous figure

Advanced Analytical Theory and Methods Clustering- Overview, K means- Use cases, Overview of methods, determining number of clusters, diagnostics, reasons to choose and cautions.

Reasons to Choose and Cautions Decisions the practitioner must make What object attributes should be included in the analysis? What unit of measure should be used for each attribute? Do the attributes need to be rescaled? What other considerations might apply?

Reasons to Choose and Cautions Object Attributes Important to understand what attributes will be known at the time a new object is assigned to a cluster E.g., information on existing customers’ satisfaction or purchase frequency may be available, but such information may not be available for potential customers . Eg. information like age and income of existing customers is available but may not be available, for new customers Best to reduce number of attributes when possible Too many attributes minimize the impact of key variables Identify highly correlated attributes for reduction Combine several attributes into one: e.g., debt/asset ratio

Reasons to Choose and Cautions Object attributes: scatterplot matrix for seven attributes

Reasons to Choose and Cautions Units of Measure K-means algorithm will identify different clusters depending on the units of measure k = 2

Reasons to Choose and Cautions Units of Measure Age dominates k = 2

Reasons to Choose and Cautions Rescaling Rescaling can reduce domination effect E.g., divide each variable by the appropriate standard deviation Rescaled attributes k = 2

Reasons to Choose and Cautions Additional Considerations K-means sensitive to starting seeds Important to rerun with several seeds – R has the nstart option Could explore distance metrics other than Euclidean E.g., Manhattan, Mahalanobis, etc. K-means is easily applied to numeric data and does not work well with nominal attributes E.g., color

Additional Algorithms K-modes clustering kmod() Partitioning around Medoids (PAM) pam() Hierarchical agglomerative clustering hclust()

Summary Properly scale the attribute values to avoid domination Clustering analysis groups similar objects based on the objects’ attributes To use k-means properly, it is important to Properly scale the attribute values to avoid domination Assure the concept of distance between the assigned values of an attribute is meaningful Carefully choose the number of clusters, k Once the clusters are identified, it is often useful to label them in a descriptive way

References https://www.slideshare.net/darlingjunior/hypothesis-testing?from_action=save https://www.mathsisfun.com/data/standard-deviation.html http:/www2.aueb.gr/users/koundouri/resees/uploads/Chapter10.ppt https://researchbasics.education.uconn.edu/wp-content/uploads/sites/1215/.../ttest.pps https://msu.edu/~fuw/teaching/Fu_ch9_Nonpara.ppt http://www.statisticshowto.com/probability-and-statistics/t-test/