Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

Similar presentations


Presentation on theme: "1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)"— Presentation transcript:

1 1 DATA DESCRIPTION

2 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject) has his age, weight, height, home address, number of units taken, and so on.

3 3 Variables l These parameters are called variables. l In statistics variables are stored in columns, each variable occupying a column.

4 4 Cross-sectional and time-series analyses l In a cross-sectional analysis a unit/subject will be the entity you are studying. For example, if you study the housing market in San Diego, a unit will be a house, and variables will be price, size, age, etc., of a house. l In a time-series analysis the unit is a time unit, say, hour, day, month, etc.

5 5 Data Types l Nominal data: male/female, colors, l Ordinal data: excellent/good/bad, l Interval data: temperature, GMAT scores, l Ratio data: distance to school, price,

6 6 Two forms l GRAPHICAL form l NUMERICAL SUMMARY form

7 7 Graphical forms l Sequence plots l Histograms (frequency distributions) l Scatter plots

8 8 Sequence plots l To describe a time series l The horizontal axis is always related to the sequence in which data were collected l The vertical axis is the value of the variable

9 9 Example: sequence plot

10 10 Histograms I l A histogram (frequency distribution) shows how many values are in a certain range. l It is used for cross-sectional analysis. l the potential observation values are divided into groups (called classes). l The number of observations falling into each class is called frequency. l When we say an observation falls into a class, we mean its value is greater than or equal to the lower bound but less than the upper bound of the class.

11 11 Example: histogram A commercial bank is studying the time a customer spends in line. They recorded waiting times (in minutes) of 28 customers: 5.9 7.6 5.3 9.7 1.6 3.5 7.4 4.0 1.6 7.3 8.2 8.4 6.5 8.9 1.1 8.6 4.3 1.2 3.3 2.1 8.4 1.1 6.7 5.0 4.5 9.4 6.3 6.4

12 12 Example: histogram

13 13 Histogram II l The relative frequency distribution depicts the ratio of the frequency and the total number of observations. l The cumulative distribution depicts the percentage of observations that are less than a specific value.

14 14 Example: relative frequency distribution l A “relative frequency” distribution plots the fraction (or percentage) of observations in each class instead of the actual number. For this problem, the relative frequency of the first class is 6/28=0.214. The remaining relative frequencies are 0.179, 0.250, 0.286 and 0.071. A graph similar to the above one can then be plotted.

15 15 Example: cumulative distribution l In the previous example, the percentage of observations that are less than 3 minutes is 0.214, the percentage of observations that are less than 5 is 0.214+0.179=0.393, less than 7 is 0.214+0.179+0.25=0.643, less than 9 is 0.214+0.179+0.25+0.286=0.929, and that less than 11 is 1.0.

16 16 Example: cumulative distribution

17 17 Histogram III l The summation of all the relative frequencies is always 1. l The cumulative distribution is non- decreasing. l The last value of the cumulative distribution is always 1. l A cumulative distribution can be derived from the corresponding relative distribution, and vice versa.

18 18 Probability l A random variable is a variable whose values cannot predetermined but governed by some random mechanism. l Although we cannot predict precisely the value of a random variable, we might be able to tell the possibility of a random variable being in a certain interval. l The relative frequency is also the probability of a random variable falling in the corresponding class. l The relative frequency distribution is also the probability distribution.

19 19 Scatter plots l A scatter plot shows the relationship between two variables.

20 20 Example: scatter plot. The following are the height and foot size measurements of 8 men arbitrarily selected from students in the cafeteria. Heights and foot sizes are in centimeters. man 1 2 3 4 5 6 7 8 Height 155 160 149 175 182 145177164 foot23.3 21.8 22.1 26.3 28.0 20.725.324.9

21 21 Example: scatter plot

22 22 Numerical Summary Forms l Central locations: mean, median, and mode. l Dispersion: standard deviation and variance. l Correlation.

23 23 Mean l Mean/average is the summation of the observations divided by the number of observations 27 22 26 24 27 20 23 24 18 32 l Sum = (27 + 22 + 26 + 24 + 27 + 20 + 23 + 24 + 18 + 32) = 243 l Mean = 243/10 = 24.3

24 24 Median l Median is the value of the central observation (the one in the middle), when the observations are listed in ascending or descending order. l When there is an even number of values, the median is given by the average of the middle two values. l When there is an odd number of values, the median is given by the middle number.

25 25 Example: median 18 20 22 23 24 24 26 27 27 32

26 26 Compare mean and median l The median is less sensitive to outliers than the mean. Check the mean and median for the following two data sets: 18 20 22 23 24 24 26 27 27 32 18 20 22 23 24 24 26 27 27 320

27 27 Mode l Mode is the most frequently occurring value(s).

28 28 Symmetry and skew l A frequency distribution in which the area to the left of the mean is a mirror image of the area to the right is called a symmetrical distribution. l A distribution that has a longer tail on the right hand side than on the left is called positively skewed or skewed to the right. A distribution that has a longer tail on the left is called negatively skewed. l If a distribution is positively skewed, the mean exceeds the median. For a negatively skewed distribution, the mean is less than the median.

29 29 Range l The range is the difference in the maximum and minimum values of the observations.

30 30 Standard deviation and variance l The standard deviation is used to describe the dispersion of the data. l The variance is the squared standard deviation.

31 31 Calculation of S.D. l Calculate the mean; l calculate the deviations; l calculate the squares of the deviations and sum them up; l Divide the sum by n-1 and take the square root.

32 32 Example: S.D. Sample 27 22 26 24 27 20 23 24 18 32 Deviation 2.7 -2.3 1.7 -0.3 2.7 -4.3 -1.3 -.3 -6.3 7.7 Sq of Dev 7.29 5.29 2.89.09 7.29 19.5 1.69.09 39.7 59.3 Sum of = 7.29 + 5.29 +..... + 59.3 = 142.1 Std. Dev. =

33 33

34 34 Empirical rules l If the distribution is symmetrical and bell- shaped, l Approximately 68% of the observations will be within plus and minus one standard deviation from he mean. l Approximately 95% observations will be within two standard deviation of the mean. l Approximately 99.7% observations will be within three standard deviations of the mean.

35 35 Percentiles l The 75th percentile is the value such that 75% of the numbers are less than or equal to this value and the remaining 25% are larger than this value. l The k-th percentile is the value such that k% of the numbers are less than or equal to this value and the remaining 1-k% are larger than this value.

36 36 Correlation coefficient l The Correlation coefficient measures how closely two variables are (linearly) related to each other. It has a value between -1 to +1. l Positive and negative linear relationships. l If two variables are not linearly related, the correlation coefficient will be zero; if they are closely related, the correlation coefficient will be close to 1 or -1.


Download ppt "1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)"

Similar presentations


Ads by Google