 # STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION.

## Presentation on theme: "STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION."— Presentation transcript:

STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION

Recap: Module 1 In Module 1, we have learned several techniques to describe data by using graphs / charts. Is it effective? Graphs / Charts are effective at giving the overall view of a situation / a population HOWEVER Graphs / Charts cannot give precise information for inferential purposes (note: infer == to make conclusions) IN FACT Graphs / Charts may not be suitable for all cases (e.g. How to describe a student result for this semester?)

Describing Data with Numerical Measures Numerical measures can be created for both populations and samples. - A parameter is a numerical descriptive measure calculated for a population. - A statistic is a numerical descriptive measure calculated for a sample. It is best to describe data by using both numerical and graphical representations whenever possible.

Arithmetic Mean or Average The mean of a set of measurements is the sum of the measurements divided by the total number of measurements (i.e. the average). where n = number of measurements ∑ x i = sum of all measurements

Example  The set: 2, 9, 11, 5, 6  When do you often use mean? When the measures of overall population follows a normal distribution. E.g. height, weight, income etc.

 The median of a set of measurements is the middle measurement when the measurements are ranked from smallest to largest.  The position of the median is Median.5(n + 1) once the measurements have been ordered.

Example  The set: 2, 4, 9, 8, 6, 5, 3n = 7  Sort: 2, 3, 4, 5, 6, 8, 9  Position:.5(n + 1) =.5(7 + 1) = 4 th Median = 5 (i.e. 4 th largest measurement  The set: 2, 4, 9, 8, 6, 5n = 6  Sort: 2, 4, 5, 6, 8, 9  Position:.5(n + 1) =.5(6 + 1) = 3.5 th Median = (5 + 6)/2 = 5.5 — average of the 3 rd and 4 th measurements

Mode The mode is the measurement which occurs most frequently. The set: 2, 4, 9, 8, 8, 5, 3 The mode is 8, which occurs twice The set: 2, 2, 9, 8, 8, 5, 3 There are two modes which are 8 and 2 (bimodal) The set: 2, 4, 9, 8, 5, 3 There is no mode (each value is unique).

Example Mean? Median? Mode? (Highest peak) Calculate the mean, median and mode for the number of quarts of milk purchased by the following 25 households: 0 0 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4 5

Measures of Center A measure along the horizontal axis of the data distribution that locates the center of the distribution. What do you use as a measure of centre? (a) Mean? (b) Median?(c) Mode?

The mean is more easily affected by extremely large or small values than the median. Extreme Values  The median is often used as a measure of center when the distribution is skewed.

Extreme Values (cont.) Skewed left: Mean < Median Skewed right: Mean > Median Symmetric: Mean = Median

Skewed Right (Positively Skewed) Skewed Right – long tail to the right A few high numbers pull the mean above the median The set: The graph: Num.Frequency 13 25 33 41 Mean = [1(3) + 2(5) + 3(3) + 4(1)] / 12 = 2.17 Median = 2 Mean > Median

Skewed Left (Negatively Skewed) Skewed Left – long tail to the left A few low numbers pull the mean below the median The set: The graph: Num.Frequency 11 23 35 43 Mean = [1(1) + 2(3) + 3(5) + 4(3)] / 12 = 2.83 Median = 3 Mean < Median

Measures of Centre Vs. Variability I was told that the average height of plants here is only 1 feet. But this tree is 10 feet high!!! !#\$&*^(&** Often, measure of centre does not give the true picture. Need to know the measure of variability from the centre too….

Measures of Variability A measure along the horizontal axis of the data distribution that describes the spread of the distribution from the center.

The Range  The range, R, of a set of n measurements is the difference between the largest and smallest measurements.  Example: A botanist records the number of petals on 5 flowers:  5, 12, 6, 8, 14  The range is R = 14 – 5 = 9

The Variance The variance is measure of variability that uses all the measurements (as oppose to range R that uses only 2 measurements, maximum and minimum). It measures the average deviation of the measurements from their mean. Flower petals:5, 12, 6, 8, 14 4 6 8 10 12 14

The variance of a population of N measurements is the average of the squared deviations of the measurements about their mean m. The Variance  The variance of a sample of n measurements is the sum of the squared deviations of the measurements about their mean, divided by (n – 1).

In calculating the variance, we squared all of the deviations, and in doing so changed the scale of the measurements. To return this measure of variability to the original units of measure, we calculate the standard deviation, the positive square root of the variance. The Standard Deviation

2 Ways to Calculate the Sample Variance 5-416 1239 6-39 81 14525 Sum45060 Use the Definition Formula :

2 Ways to Calculate the Sample Variance 525 12144 636 864 14196 Sum45465 Use the Calculational Formula:

The value of s is ALWAYS positive. The larger the value of s 2 or s, the larger the variability of the data set. Why divide by n –1? The sample standard deviation s is often used to estimate the population standard deviation s. Dividing by n –1 gives us a better estimate of s. Some Notes

1. Question: Find the mean, median and mode of: 5, 7, 3, 5, 6, 8, 5, 6, 4, 6, 25 Solution: Note: First, arrange the data 3, 4, 5, 5, 5, 6, 6, 6, 7, 8, 25 median = 6; mean = 80/11 = 7.27 ; modes = 5 and 6 2. Question: Eliminate the last observation x= 25 and then find the mean, median and mode. How do these values compare with those found using the full data set? Solution: median = 5.5; mean = 55/10 = 5.5; modes = 5 and 6. The mean is smaller. 3. Question: How do possible outliers (such as 25) affect these values? Solution: The mean is very much affected by the outlier, while the median and mode are not so. Exercise 1

Given the observations 7, 9, 10, 6, 8, 7, 8, 9, 8 calculate: 1. the range Solution : R = 10 – 6 = 4 2. the mean Solution : Mean = 72 / 9 = 8 3. the variance Solution : Variance = [588 – (72 2 /9)] / 8 = 12 / 8 = 1.5 4. the standard deviation Solution : Standard Deviation = √1.5 = 1.225 Exercise 2

STATISTIC & INFORMATION THEORY (CSNB134) NUMERICAL DATA REPRESENTATION --END--

Download ppt "STATISTIC & INFORMATION THEORY (CSNB134) MODULE 2 NUMERICAL DATA REPRESENTATION."

Similar presentations