Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploratory Data Analysis

Similar presentations


Presentation on theme: "Exploratory Data Analysis"— Presentation transcript:

1 Exploratory Data Analysis
l Chapter 3 l Exploratory Data Analysis 3.1 Graphical Displays of Data 3.2 Measures of Central Tendency 3.3 Measures of Dispersion

2 3.1 Graphical Displays of Data
Most of the statistical information in newspapers, magazines, company reports and other publications consists of data that are summarized and presented in a form that is easy for the reader to understand.

3 3.1 Graphical Displays of Data
Presentation of Qualitative Data A graphic display can reveal at a glance the main characteristics of a data set. Their presentation are depend on the nature of data, whether the data is in quantitative(ex. income and CGPA) or qualitative(ex. Gender and ethnic group). Three types of graphs used to display qualitative data: bar graph / column chart pie chart line chart

4 3.1 Graphical Displays of Data
Presentation of Qualitative Data

5 3.1 Graphical Displays of Data
Bar Chart Bar chart is used to display the frequency distribution in the graphical form. It consists of two orthogonal axes and one of the axes represent the observations while the other one represents the frequency of the observations. The frequency of the observations is represented by a bar.

6 3.1 Graphical Displays of Data
Pie Chart Pie Chart is used to display the frequency distribution. It displays the ratio of the observations. It is a circle consists of a few sectors. The sectors represent the observations while the area of the sectors represent the proportion of the frequencies of that observations.

7 3.1 Graphical Displays of Data
Line Chart Line chart is used to display the trend of observations. It consists of two orthogonal axes and one of the axes represent the observations while the other one represents the frequency of the observations. The frequency of the observations are joint by lines. Example: Table below shows the number of sandpipers recorded between January 1989 till December 1989.

8 3.1 Graphical Displays of Data
Presentation of Quantitative Data There are few graphs available for the graphical presentation of the quantitative data. Frequency polygon Histogram Ogive Boxplot (Will be our focus in this chapter)

9 3.1 Graphical Displays of Data
Presentation of Quantitative Data Histogram Histogram looks like the bar chart except that the horizontal axis represent the data which is quantitative in nature. There is no gap between the bars.

10 3.1 Graphical Displays of Data
Presentation of Quantitative Data Frequency Polygon Frequency polygon looks like the line chart except that the horizontal axis represent the class mark of the data which is quantitative in nature.

11 3.1 Graphical Displays of Data
Presentation of Quantitative Data Ogive Ogive is a line graph with the horizontal axis represent the upper limit of the class interval while the vertical axis represent the cumulative frequencies.

12 3.1 Graphical Displays of Data
Presentation of Quantitative Data Boxplot The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum.

13 3.1 Graphical Displays of Data
Presentation of Quantitative Data Boxplot Divided by data sets into fourths or four equal parts.

14 3.1 Graphical Displays of Data
Presentation of Quantitative Data Boxplot How to obtain Quartiles? Q2 – Median Q1 – Median between lowest value and Q2 Q3 – Median between Q2 and largest value Examples Odd set of numbers (1, 2, 5, 6, 7), 9, (12, 15, 18, 19, 27) Even set of numbers (3, 5, 7, 8, 9), (11, 15, 16, 20, 21) Q1 Q2 Q3 Q1 Q2 (9+11)/2 =10 Q3

15 3.1 Graphical Displays of Data
Presentation of Quantitative Data Boxplot

16 3.1 Graphical Displays of Data
Presentation of Quantitative Data Boxplot Outlier Extreme observations Can occur because of the error in measurement of a variable, during data entry or errors in sampling.

17 3.1 Graphical Displays of Data
Presentation of Quantitative Data Boxplot Outlier Checking for outliers by using Quartiles Step 1: Determine the first and third quartiles of data. Step 2: Compute the interquartile range (IQR). Step 3: Determine the fences. Fences serve as cut-off points for determining outliers. Step 4: If data value is less than the lower fence or greater than the upper fence, considered outlier.

18 How to: Boxplot Interpretation
l Chapter 3.1 l How to: Boxplot Interpretation

19 The Basics A boxplot splits the data set into quartiles. It consists of a minimum value, the first quartile (Q1) to the third quartile median, and a maximum value Outliers are plotted separately as points on the chart

20 Interpreting Boxplot Things that can be described on boxplot
The five numbers summary Range of the boxplot The IQR Shape of the data More than one boxplot Compare their shape and position

21 Interpreting Boxplot The five numbers summary Minimum = -25
First Quartile = 300 Second Quartile / Median = 400 Third Quartile = 600 Maximum = 1000

22 Interpreting Boxplot Range
In the boxplot above, data values ranged from about -700 (the smallest outlier) to 1700 (the largest outlier), so the range is If you ignore outliers, the range is illustrated by the distance between the opposite ends of the whiskers - about 1000 in the boxplot above.

23 Interpreting Boxplot Interquartile Range (IQR)
In the boxplot above, the range between the quartiles is equal to or about 300 Based on Q1, we know that 25% of the data has a value below 75% of the data has a value above 300 Based on Q2, we know half of the data has a value less than 400 Based on Q3, we know that 75% of the data has a value below 25% of the data has a value above 600

24 Interpreting Boxplot Shape of the data
Boxplots often provide information about the shape of a data set. The examples below show some common patterns

25 Interpreting Boxplot Shape of the data
For our case, the boxplot is skewed to the right

26 Interpreting More Than One Boxplot
The second boxplot is comparatively short This suggests that the overall data of the second boxplot has small variance (most of the data have similar values)

27 Interpreting More Than One Boxplot
The first and third boxplot is comparatively tall This suggests that the variance for these boxplot is high (most of the data did not have similar values)

28 Interpreting More Than One Boxplot
The third boxplot is much higher than the fourth boxplot This could suggest a differences in the value between groups. As can be seen, almost 75% of the data in the third boxplot have higher value than the fourth boxplot.

29 Interpreting More Than One Boxplot
There are obvious variance differences between first and second boxplots; second boxplots and third boxplot

30 Interpreting More Than One Boxplot
Same median, different distribution Look at the first, second and third boxplot. Their medians are all at the same place. We know that for the three boxplots, more than half of their data falls below Q2, which is However they show differences in variance.

31 Exercise Describe about each boxplot
Compare the boxplots, what can you say?

32 3.2 Measures of Central Tendency
Fig 3.6.5 3.2 Measures of Central Tendency Measure of central tendency is a summary statistics that are used to summarize a set of observations. The common measures of central tendency are Mean Median Mode

33 3.2 Measures of Central Tendency
Mean Mean (sample) is defined by The mean of a sample is the sum of the measurements divided by the number of measurements in the set. Mean is denoted by

34 3.2 Measures of Central Tendency
Example The mean for this case is

35 3.2 Measures of Central Tendency
Median Median is the middle value of a set of observations arranged in order of magnitude and normally is denoted by The median depends on the number of observations in the data, . -If is odd, then the median is the th observation of the ordered observations. -If is even, then the median is the arithmetic mean of the th observation and the th observation.

36 3.2 Measures of Central Tendency
Example The median of this data (4, 6, 3, 1, 2, 5, 7, 3) is 3.5. Rearrange the data in order of magnitude becomes 1,2,3,3,4,5,6,7. As (even), the median is the mean of the 4th and 5th observations that is 3.5.

37 3.2 Measures of Central Tendency
Mode Mode of a set of observations is the observation with the highest frequency and is usually denoted by Sometimes mode can also be used to describe qualitative data. Mode has the advantage in that it is easy to calculate and eliminates the effect of extreme values. However, mode may not exist and even if it does exit, it may not be unique.

38 3.2 Measures of Central Tendency
Mode If a set of data has 2 measurements with higher frequency, therefore the measurements are assumed as data mode and known as bimodal data. If a set of data has more than 2 measurements with higher frequency so the data can be assumed as no mode. Example: The mode for the observations 4,6,3,1,2,5,7,3 is 3.

39 3.2 Measures of Central Tendency
Mode If a set of data has 2 measurements with higher frequency, therefore the measurements are assumed as data mode and known as bimodal data. If a set of data has more than 2 measurements with higher frequency so the data can be assumed as no mode. Example: The mode for the observations 4,6,3,1,2,5,7,3 is 3.

40 3.2 Measures of Central Tendency
Mode If a set of data has 2 measurements with higher frequency, therefore the measurements are assumed as data mode and known as bimodal data. If a set of data has more than 2 measurements with higher frequency so the data can be assumed as no mode. Example: The mode for the observations 4,6,3,1,2,5,7,3 is 3.

41 3.3 Measures of Dispersion
The measure of dispersion or spread is the degree to which a set of data tends to spread around the average value. It shows whether data will set is focused around the mean or scattered. The common measures of dispersion are variance and standard deviation. The standard deviation actually is the square root of the variance. The sample variance is denoted by and the sample standard deviation is denoted by s.

42 3.3 Measures of Dispersion
Range Range is the simplest measure of dispersion to calculate. Range = Largest value – Smallest value Example: Range = 267,277 – 49,651 = 217,626 squaremiles.

43 3.3 Measures of Dispersion
Variance The variance of a sample (also known as mean square) for the raw (ungrouped) data is denoted by and defined by: Example (using previous data in Range): Range = 267,277 – 49,651 = 217,626 squaremiles.

44 3.3 Measures of Dispersion
Standard deviation It is simply a square root value of variance Example

45 How to interpret Standard deviation

46

47

48

49

50

51 Dogs with standard height
Too High Too Short


Download ppt "Exploratory Data Analysis"

Similar presentations


Ads by Google