Presentation is loading. Please wait.

Presentation is loading. Please wait.

Instructor: Prof. Ken Tsang T.A. : Ms Lisa Liu Office: E409 Tel: 362 0606(office) 362 0630(T.A.)

Similar presentations


Presentation on theme: "Instructor: Prof. Ken Tsang T.A. : Ms Lisa Liu Office: E409 Tel: 362 0606(office) 362 0630(T.A.)"— Presentation transcript:

1 Instructor: Prof. Ken Tsang T.A. : Ms Lisa Liu Office: E409 Tel: 362 0606(office) 362 0630(T.A.) Email: kentsang@uic.edu.hk (Instructor)@uic.edu.hk songfengfu@uic.edu.hk (TA) Speaking of Statistics (13)

2 What is Statistics all about?  The subject of statistics involves the study of how to collect, summarize, analyze and interpret data.  Data are numerical facts and figures from which conclusions can be drawn. Such conclusions are important to the decision-making processes of many professions and organizations.

3 Some sources of data are:  Data distributed by an organization or an individual  A designed experiment  A survey  An observational study  Web, telephone Data can be  Curves, figures  Sounds  Papers, books  Web, telephone process Data

4 Distinguished Statisticians in History! Sir R. A. Fisher 1890-1962 Karl Pearson 1857-1936

5 W. Edwards Deming --The Father of the Quality Evolution 1900-1993

6 Data Scientist: The Sexiest Job of the 21st Century

7

8 Final Evaluation Proportion

9 CRA (Criterion-Referenced Assessment) Adoption of the Criterion-Referenced Assessment (CRA) for evaluating students’ performance OBTL Syllabus CRA model is directly compatible with the OBTL philosophy.

10 UIC Regulations on CRA We will use rubric for following assessments. 1.Oral Presentation and Group Presentation (20%) 2. Final Examination (40%)

11 Rubric for Assessment of Oral and Group Presentation (1) Criteria for assessment Performance levels Excellent 4 Good 3 Satisfactory 2 Marginal Pass 1 Fail 0 Content of presentation Organization ( _20_ % weighting) Accuracy and Depth ( _40_ % weighting) Presentation techniques Oral English ( _10_ % weighting) Body language and facial expressions ( _10_ % weighting) Time management ( _10_ % weighting) Question and Answer Performance Responsiveness ( _10_ % weighting)

12 Oral and Group Presentation (1) Choose your teammates 4-5 members in one team Submit your group form before November Study rubric for oral presentation (1)

13 Choose a topic for your team Prepare your PPT Oral presentation will be given (roughly) on the 12 th week (Nov ? Dec 2014)

14 Suggested Grade Distribution Assessment grade system: A (Not more than 5%) A (Not more than 5%) A and A- (Not more than 15%) A and A- (Not more than 15%) A and B that include A, A-, B+, B and B- (Not more than 75%) A and B that include A, A-, B+, B and B- (Not more than 75%) Below C and not include C (No any limit ). Below C and not include C (No any limit ). Letter Grade Academic Performance AExcellent A-Excellent B+Good B B-Good C+Satisfactory C DMarginal Pass FFail

15 Some notices on this Course  Assignments must be handed in before the deadline. After the deadline, we refuse to accept your assignments!  For the mid-term test and final examination, you cannot bring anything except some stationeries and water! Mobile are not allowed.  For the final examination, we cannot tell you the score before the AR inform the official results. If you have any question on the score, you can check the marked sheet via AR.

16 General Information Textbook Essentials of Business Statistcs Bowerman/O'Connell/Murphree/Orris McGraw Hill, International Edition ISBN 978-0-07-131471-8 Advantages  Unified textbook for all the year one students  More applications

17 General Information References  Basic Statistics, for Business & Economics Fifth Edition D.A. Lind, W.G. Marchal and S.A. Wathen 2006, McGraw Hill, International Edition  Business Statistics, A First Course, Fourth Ed. D.M. Levine, T.C. Krehbiel and M.L. Berenson 2006, Pearson Prentice Hall, New Jersey  Statistics for Business and Economics, Ninth Ed. J.T. McClave, P.G. Benson and T. Sincich 2005, Pearson Prentice Hall, New Jersey  Modern Elementary Statistics, 11th Ed. J.E. Freund, 2004, Prentice Hall.

18 Statistics for the Behavioral Sciences Frederick J Gravetter and Larry B. Wallnau Wadsworth Publishing; 8 edition (December 10, 2008) 18

19 Chapter 1 An introduction to Business Statistics  Populations and Samples  Sampling a Population of Existing Units  Sampling a Process  An Introduction to Survey Sampling

20 Section 1.1 Populations( 总体 ) and Samples( 样本 ) A set of existing units (people, objects, or events) Population 1.All of the last year’s graduates of Dartmouth College’s Master of Business Administration program. 2.All Lincoln Town Cars that were produced last year. 3.All accounts receivable invoices accumulated last year by The Procter & Gamble Company. 4.All fire reported last month to the Tulsa, Oklahoma, fire department.

21 A measurable characteristic of the population. Variable( 变量 )  The variable is said to be quantitative( 定量的 ): Measurements that represent quantities (for example, “how much” or “how many”). For example, annual starting salary is quantitative, age and number of children is also quantitative  The variable is said to be qualitative( 定性的 ) or categorical( 属性的 ): A descriptive category to which a population unit belongs. For example, a person’s gender, the make of an automobile and whether a person who purchases a product is satisfied with the product are qualitative. We carry out a measurement to assign a value of a variable to each population unit.

22 Nominative( 无顺序分类的 ): Nominative( 无顺序分类的 ):  Identifier or name  Unranked categorization Example: gender, car color Example: gender, car color Ordinal( 顺序的 ): Ordinal( 顺序的 ):  All characteristics of nominative plus…  Rank-order categories  Ranks are relative to each other Example: Low (1), moderate (2), or high (3) risk Example: Low (1), moderate (2), or high (3) risk There are two types of qualitative variables:

23 An examination of the entire population of measurements. Census( 普查 ) Note: Census usually too expensive, too time consuming, and too much effort for a large population. A selected subset of the units of a population. Sample Population Sample

24 For example, a university graduated 8,742 students a. This is too large for a census. b. So, we select a sample of these graduates and learn their annual starting salaries. learn their annual starting salaries. Measured values of the variable of interest for the sample units. Measured values of the variable of interest for the sample units. For example, the actual annual starting salaries of the sampled graduates. For example, the actual annual starting salaries of the sampled graduates. Sample of measurements

25 For example, for a set of annual starting salaries, we want to know: For example, for a set of annual starting salaries, we want to know:  How much to expect  What is a high versus low salary  How much the salaries differ from each other If the population is small enough, could take a census and not have to sample and make any statistical inferences If the population is small enough, could take a census and not have to sample and make any statistical inferences But if the population is too large, then ………. But if the population is too large, then ………. The science of describing the important aspects of a set of measurements Descriptive statistics

26 Statistical Inference( 统计推断 ) The science of using a sample of measurements to make generalizations about the important aspects of a population of measurements. For example, use a sample of starting salaries to estimate the important aspects of the population of starting salariesFor example, use a sample of starting salaries to estimate the important aspects of the population of starting salaries There is a criteria on how to choose a sample: the information contained in a sample is to accurately reflect the population under study.

27 The Lady Tasting Tea Tea is tasted different depending upon whether the tea was poured into the milk or whether the milk was poured into the tea. Let us test the proposition!

28 Observation Study ---Smoking is harmful to health

29 Section 1.2 Sampling a Population of Existing Units For example, randomly pick two different people from a group of 15: For example, randomly pick two different people from a group of 15:  Number the people from 1 to 15; and write their numbers on 15 different slips of paper  Thoroughly mix the papers and randomly pick two of them  The numbers on the slips identifies the people for the sample Each population unit has the same chance of being selected as every other unit Each population unit has the same chance of being selected as every other unit  Each possible sample (of the same size) has the same chance of being selected A random sample is a sample selected from a population so that: Random sample( 随机样本 )

30 Guarantees a sample of different units Guarantees a sample of different units  Each sampled unit contributes different information  Sampling without replacement is the usual and customary sampling method A sampled unit is withheld from possibly being selected again in the same sample Sample without replacement( 无放回抽样 ) The unit is placed back into the population for possible reselection However, the same unit in the sample does not contribute new information Replace each sampled unit before picking next unit Sample with replacement( 有放回抽样 )

31 Example 1.1 Example 1.1 The Cell Phone Case: Estimating Cell Phone Costs The bank has 2,136 employees on a 500-minute-per- month plan with a monthly cost of $50. The bank will estimate its cellular cost per minute for this plan by examining the number of minutes used last month by each of 100 randomly selected employees on this 500-minute plan. According to the cellular management service, if the cellular cost per minute for the random sample of 100 employees is over 18 cents per minute, the bank should benefit from automated cellular management of its calling plans.

32  In order to randomly select the sample of 100 cell phone users, the bank will make a numbered list of the 2,136 users on the 500-minite plan. This list is called a frame( 设计框架 ).  The bank can use a random number table, such as Table 1.1(a), or a computer software package, such as Table 1.1 (b), to select the needed sample.  The 100 cellular-usage figures are given in Table 1.2.

33

34

35 Approximately Random Samples Sometimes it is not possible to list and thus number all the units in a population. In such a situation we often select a systematic sample, which approximates a random sample. A Systematic Sample( 系统抽样 ) Randomly enter the population and systematically sample every kth unit.

36 Example 1.2 Example 1.2 The Marketing Research Case: Rating a New Bottle Design To study consumer reaction to a new design, the brand group will use “mall intercept method” in which shoppers at a large metropolitan shopping mall are intercepted and asked to participate in a consumer survey. The questionnaire are shown in Figure 1.1. Each shopper will be exposed to the new bottle design and asked to rate the bottle image using a 7-point “Likert scale.” To study consumer reaction to a new design, the brand group will use “mall intercept method” in which shoppers at a large metropolitan shopping mall are intercepted and asked to participate in a consumer survey. The questionnaire are shown in Figure 1.1. Each shopper will be exposed to the new bottle design and asked to rate the bottle image using a 7-point “Likert scale.” We select a systematic sample. To do this, every 100 th shopper passing a specified location in the mall will be invited to participate in the survey. During a Tuesday afternoon and evening, a sample of 60 shoppers is selected by using the systematic sampling process. The 60 composite scores are given in Table 1.3. From this table, we can estimate that 95 percent of the shoppers would give the bottle design a composite score of at least 25. We select a systematic sample. To do this, every 100 th shopper passing a specified location in the mall will be invited to participate in the survey. During a Tuesday afternoon and evening, a sample of 60 shoppers is selected by using the systematic sampling process. The 60 composite scores are given in Table 1.3. From this table, we can estimate that 95 percent of the shoppers would give the bottle design a composite score of at least 25.

37

38 Voluntary response sample Participants select themselves to be in the sample Participants “self-select”Participants “self-select” For example, calling in to vote on American IdolFor example, calling in to vote on American Idol Commonly referred to as a “non-scientific” sampleCommonly referred to as a “non-scientific” sample Usually not representative of the population Over-represent individuals with strong opinionsOver-represent individuals with strong opinions Usually, but not always, negative opinionsUsually, but not always, negative opinions Another Sampling Method

39 Section 1.3 Sampling a Process Process( 过程 ) A sequence of operations that takes inputs (labor, raw materials, methods, machines, and so on) and turns them into outputs (products, services, and the like) Inputs Process Outputs

40 Cars will continue to be made over time Cars will continue to be made over time  For example, all automobiles of a particular make and model, for instance, the Lincoln Town Car  The “population” from a process is all output produced in the past, present, and the yet-to-occur future. Processes produce output over time

41 Example 1.3 Example 1.3 The Coffee Temperature Case: Monitoring Coffee Temperatures This case concerns coffee temperatures at a fast-food restaurant. To do this, the restaurant personnel measure the temperature of the coffee being dispensed (in degrees F) at half-hour intervals from 10 A.M. to 9:30 P.M. on a given day. Data is list on Table 1.7.  A process is in statistical control if it does not exhibit any unusual process variations.  To determine if a process is in control or not, sample the process often enough to detect unusual variations  A runs plot is a graph of individual process measurements over time. Figure 1.3 shows a runs plot of the temperature data.

42

43 Figure 1.3 Runs Plot of Coffee Temperatures: The Process is in Statistical Control.

44 Over time, temperatures appear to have a fairly constant amount of variation around a fairly constant level  The temperature is expected to be at the constant level shown by the horizontal blue line Sometimes the temperature is higher and sometimes lower than the constant level  About the same amount of spread of the values (data points) around the constant level The points are as far above the line as below it The data points appear to form a horizontal band So, the process is in statistical control  Coffee-making process is operating “consistently” Results

45 Because the coffee temperature has been and is presently in control, it will likely stay in control in the future Because the coffee temperature has been and is presently in control, it will likely stay in control in the future  If the coffee making process stays in control, then coffee temperature is predicted to be between 152 o and 170 o F In general, if the process appears from the runs plot to be in control, then it will probably remain in control in the future In general, if the process appears from the runs plot to be in control, then it will probably remain in control in the future  The sample of measurements was approximately random  Future process performance is predictable Remark

46 Section 1.4 An Introduction to Survey Sampling Already know some sampling methods  Also called sampling designs, they are:  Random sampling The focus of this book  Systematic sampling  Voluntary response sampling But there are other sample designs:  Stratified random sampling( 分层随机抽样 )  Cluster sampling( 分块抽样 )

47 Divide the population into non-overlapping groups, called strata, of similar units Separately, select a random sample from each and every stratum Combine the random samples from each stratum to make the full sample Appropriate when the population consists of two or more different groups so that:  The groups differ from each other with respect to the variable of interest  Units within a group are similar to each other For example, divide population into strata by age, gender, income, etc Stratified Random Sample

48 “Cluster” or group a population into subpopulations  Cluster by geography, time, and so on… Each cluster is a representative small-scale version of the population (i.e. heterogeneous group) A simple random sample is chosen from each cluster Combine the random samples from each cluster to make the full sample Appropriate for populations spread over a large geographic area so that…  There are different sections or regions in the area with respect to the variable of interest  A random sample of the cluster Cluster Sampling

49  Want a sample containing n units from a population containing N units  Take the ratio N/n and round down to the nearest whole number  Call the rounded result k  Randomly select one of the first k elements from the population list  Step through the population from the first chosen unit and select every k th unit  This method has the properties of a simple random sample, especially if the list of the population elements is a random ordering More on Systematic Sampling

50  Random sampling should eliminate bias  But even a random sample may not be representative because of:  Under-coverage Too few sampled units or some of the population was excluded  Non-response When a sampled unit cannot be contacted or refuses to participate  Response bias Responses of selected units are not truthful Sampling Problem

51 Chapter 2 Descriptive Statistics  Describing the Shape of a Distribution  Describing Central Tendency  Measures of Variation  Percentiles, Quartiles, and Box-and- Whiskers Displays  Describing Qualitative Data  Weighted Means

52 Section 2.1 Describing the Shape of a Distribution  To know what the population looks like, find the “shape” of its distribution  Picture the distribution graphically by any of the following methods:  Stem-and-leaf display( 茎叶图 )  Frequency distributions( 頻率分布表 )  Histogram( 直方图 )  Dot plot( 点图 )

53 The purpose of a stem-and-leaf display is to see the overall pattern of the data, by grouping the data into classes The purpose of a stem-and-leaf display is to see the overall pattern of the data, by grouping the data into classes  To see: the variation from class to class the variation from class to class the amount of data in each class the amount of data in each class the distribution of the data within each class the distribution of the data within each class Best for small to moderately sized data distributions Best for small to moderately sized data distributions Stem-and-leaf display

54 Example 2.1 Example 2.1 The Car Mileage Case  In this case study, we consider a tax credit offered by the federal government to automakers for improving the fuel economy of midsize cars.  To find the combined city and highway mileage estimate for a particular car model, the EPA tests a sample of cars.  Table 2.1 presents the sample of 49 gas mileages that have been obtained by the new midsize model.

55 30.830.932.032.332.6 31.730.431.432.731.4 30.132.530.831.231.8 31.630.332.830.631.9 32.131.332.031.732.8 33.332.131.531.431.5 31.332.532.432.231.6 31.031.831.031.530.6 32.030.429.831.732.2 32.430.531.130.6 Table 2.1 A sample of 49 mileages

56 The stem-and-leaf display of car mileages: 29 8 30 13445666889 31 00123344455566777889 32 0001122344556788 33 3 29 + 0.8 = 29.8 33 + 0.3 = 33.3

57 Another display of the same data using more classes  Starred classes (*) extend from 0.0 to 0.4  Unstarred classes extend from 0.5 to 0.9 29 8 30* 1344 30 5666889 31* 001233444 31 55566777889 32* 0001122344 32 556788 33* 3

58  Looking at the last stem-and-leaf display, the distribution appears almost “symmetrical”  The upper portion of the display… Stems 29, 30*, 30, and 31* Stems 29, 30*, 30, and 31*  … is almost a mirror image of the lower portion of the display Stems 31, 32*, 32, and 33* Stems 31, 32*, 32, and 33*  But not exactly a mirror reflection Maybe slightly more data in the lower portion than in the upper portion Maybe slightly more data in the lower portion than in the upper portion  Later, we will call this a slightly “left- skewed” distribution

59 Constructing a Stem-and-Leaf Display 1.Decide what units will be used for the stems and the leaves. As a general rule, choose units for the stems so that there will be somewhere between 5 and 20 stems. 2.Place the stems in a column with the smallest stem at the top of the column and the largest stem at the bottom. 3.Enter the leaf for each measurement into the row corresponding to the proper stem. The leaves should be single-digit numbers (rounded values). 4.If desired, rearrange the leaves so that they are in increasing order from left to right.

60 Example 2.2 Example 2.2 The Payment Time Case: Reducing Payment Times In order to assess the effectiveness of the system, the consulting firm will study the payment times for invoices processed during the first three months of the system’s operation. During this period, 7,823 invoices are processed using the new system. To study the payment times of these invoices, the consulting firm numbers the invoices from 0001 to 7823 and uses random numbers to select a random sample of 65 invoices. The resulting 65 payment times are given in Table 2.2

61 2229161518171213171615 1917102115141718122014 1615162022142519231519 1823221616191318242426 1318171524151714181721 16212519202716171621 Table 2.2 A Sample of Payment Times (in Days) for 65 Randomly Selected Invoices.

62 1 10 0 1 11 3 12 00 6 13 000 10 14 0000 17 15 0000000 26 16 000000000 (8) 17 00000000 30 18 000000 24 19 00000 19 20 000 16 21 000 13 22 000 10 23 00 8 24 000 5 25 00 3 26 0 2 27 0 1 28 1 29 0 Shorter tail Longer tail The leftmost column of numbers are the numbers are the amounts of values in each stem The number 8 in parentheses indicates that there are 8 payments in the stem for 17 daysThe number 8 in parentheses indicates that there are 8 payments in the stem for 17 days The number 27 (no parentheses) indicates that there are 27 payments made in 16 or less daysThe number 27 (no parentheses) indicates that there are 27 payments made in 16 or less days

63  Looking at this display, we see that all of the sampled payment times are substantially less than the 39-day typical payment time of the former billing system.  The stem-and-leaf display do not appear symmetrical. The “tail” of the distribution consisting of the higher payment times is longer than the “tail” of the distribution consisting of the smaller payment times.  We say that the distribution is skewed with a tail to the right. The Payment Times: Results

64 A frequency distribution is a list of data classes with the count or “frequency” of values that belong to each class “Classify and count” “Classify and count” The frequency distribution is a table The frequency distribution is a table Show the frequency distribution in a histogram The histogram is a picture of the frequency distribution The histogram is a picture of the frequency distribution See Examples 2.2, The Payment Time Case Frequency Distribution and Histogram

65  Steps in making a frequency distribution: 1.Determine the number of classes K 2.Determine the class length 3.Set the starting value for the classes, that is, the distribution “floor” 4.Calculate the class limits 5.Setup all the classes  Then tally the data into the K classes and record the frequencies Constructing the frequency distribution

66  Group all of the n data into K number of classes  K is the smallest whole number for which 2 K  n  In Examples 2.2, n = 65  For K = 6, 2 6 = 64, < n  For K = 7, 2 7 = 128, > n  So use K = 7 classes The number of classes K

67  Class length L is the step size from one to the next  In Examples 2.2, The Payment Time Case, the largest value is 29 days and the smallest value is 10 days, so  Arbitrarily round the class length up to 3 days/class Class Length L

68  The classes start on the smallest data value  This is the lower limit of the first class  The upper limit of the first class is smallest value + (L – 1)  In the example, the first class starts at 10 days and goes up to 12 days  The second class starts at the upper limit of the first class + 1 and goes up (L – 1) more  The second class starts at 13 days and goes up to 15 days  And so on Starting the classes

69 Classes (days)TallyFrequency 10 to 12|||3 13 to 15 ||||14 16 to 18 ||| 23 19 to 21 || 12 22 to 24 |||8 25 to 27 |||| 4 28 to 30| 1 65 |||| Check: All frequencies must sum to n Tallies and Frequencies: Example 2.2

70  The relative frequency of a class is the proportion or fraction of data that is contained in that class  Calculated by dividing the class frequency by the total number of data values  Relative frequency may be expressed as either a decimal or percent  A relative frequency distribution is a list of all the data classes and their associated relative frequencies Relative Frequency( 相对频率 )

71 Classes (days)FrequencyRelative Frequency 10 to 123 3/65 = 0.0462 13 to 1514 14/65 = 0.2154 16 to 18230.3538 19 to 21120.1846 22 to 2480.1231 25 to 2740.0615 28 to 30 10.0154 651.0000 Check: All relative frequencies must sum to 1 Relative Frequency: Example 2.2

72 Classes Frequency Relative Frequency Boundaries Midpoint 10 to 12 3 0.0462 9.5, 12.5 11 13 to 15 14 0.2154 12.5, 15.5 14 16 to 18 23 0.3538 15.5, 18.5 17 19 to 21 12 0.1846 18.5, 21.5 20 22 to 24 8 0.1231 21.5, 24.5 23 25 to 27 4 0.0615 24.5, 27.5 26 28 to 30 1 0.0154 27.5, 30.5 29 65 1.0000 65 1.0000

73  A graph in which rectangles represent the classes  The base of the rectangle represents the class length  The height of the rectangle represents  the frequency in a frequency histogram, or  the relative frequency in a relative frequency histogram Histogram

74 Example 2.2: The Payment Times Case Frequency Histogram Relative Frequency Histogram As with the earlier stem-and-leaf display, the tail on the right appears to be longer than the tail on the left.

75 Example 2.1 The Car Mileage Case We should use K=6 classes, the largest and smallest mileages in Table 2.1 are 33.3 and 29.8. So we find the class length by computing (33.3-29.8)/6=0.5833. To obtain a more convenient class length, we round this value up to 0.6. To form the first class, we start with the smallest mileage-29.8-and add 0.5 to obtain the class 29.8-30.3. Following this instruction, we can obtain all classes. Remark: Although we have given a procedure for determining the number of classes, it is often desirable to let the nature of the problem determine the classes.

76 Classes Freq. Relative Freq. Boundaries Midpoint 29.8-30.3 3 0.0612 29.75, 30.35 30.05 30.4-30.9 9 0.1837 30.35, 30.95 30.65 31.0-31.5 12 0.2449 30.95, 31.55 31.25 31.6-32.1 13 0.2653 31.55, 32.15 31.85 32.2-32.7 9 0.1827 32.15, 32.75 32.45 32.8-33.3 3 0.0612 32.75, 33.35 33.05 Table: A Frequency Distribution and a Relative Frequency Distribution of the 49 Mileages

77

78 Back-to-Back histogram Display Comparing Two Distributions with back-to- back Histogram 78

79 The Normal Curve( 正态曲线 )  Symmetrical and bell-shaped curve for a normally distributed population  The height of the normal over any point represents the relative proportion of values near that point Example 2.1, The Car Mileages Case

80 The bean machine is a device invented by Sir Francis Galton to demonstrate how the normal distribution appears in nature. This machine consists of a vertical board with interleaved rows of pins. Small balls are dropped from the top and then bounce randomly left or right as they hit the pins. The balls are collected into bins at the bottom and settle down into a pattern resembling the Gaussian curve. Normal distribution in nature

81 Height (in.) Normal distribution in nature Distribution of the heights of 1052 women fits the normal distribution, with a goodness of fit p value of 0.75

82 Histogram of daily percentage changes in the S&P 500 index

83 那么何谓正态分布呢?通俗地讲就是 “ 中间多,两头少 ” ,比如我们每个 人的身高,巨人或侏儒在人口总数中所占的比例都很小,而中等身材的 人占的比例最大。换成统计学的讲法,如果把身高做为随机变量,那么 这种规律就是说一个人的身高达到平均值的概率最大,但身高越偏离平 均值,其概率也越小。在自然现象和社会现象中,大量的随机变量都服 从或近似地服从正态分布. 由于 P { a-b<X≤a+b } =0.6826 , P { a-2b<X≤a+2b } =0.9544 , P { a- 3b<X≤a+3b } =0.9974 ,我们可以看到,对于服从正态分布的随机变量 X 来说,它的值落在 a-3b 与 a+3b 之间几乎是肯定的,这就是所谓的 “3b 规 则 ” 。

84 Skewness( 偏度 ) Skewed distributions are not symmetrical about their center. Rather, they are lop-sided with a longer tail on one side or the other. A population is distributed according to its relative frequency curve The skew is the side with the longer tail Right Skewed Left Skewed Symmetric

85 Section 2.2 Describing Central Tendency Population Parameters( 总体参数 ) A population parameter is a number calculated from all the population measurements that describes some aspect of the population The population mean, denoted , is a population parameter and is the average of the population measurements

86 Point Estimates and Sample Statistics A point estimate( 点估计 ) is a one-number estimate of the value of a population parameter A sample statistic is a number calculated using sample measurements that describes some aspect of the sample  Use sample statistics as point estimates of the population parameters The sample mean, denoted x, is a sample statistic and is the average of the sample measurements  The sample mean is a point estimate of the population mean

87 Measures of Central Tendency  Mean,  : The average or expected value  Median, M d : The value of the middle point of the ordered measurements  Mode, M o : The most frequent value

88 The Mean( 均值 ) Population X 1, X 2, …, X N  Population Mean Sample x 1, x 2, …, x n Sample Mean

89 The Sample Mean( 样本均值 ) For a sample of size n, the sample mean is defined as and is a point estimate of the population mean  It is the value to expect, on average and in the long run

90 90 Mean as the balance point for a distribution Data: 2, 2, 6, 10 mean=(2+2+6+10)/4=5 What will happen to the mean if we add one more number to the data?

91 Example: Car Mileage Case Sample mean for first five car mileages from Table 2.1 30.8, 31.7, 30.1, 31.6, 32.1

92 Example: Car Mileage Case Continued Sample mean for all the car mileages from Table 2.1 Based on this calculated sample mean, the point estimate of mean mileage of all cars is 31.5531 mpg

93 The Median( 中位数 ) The population or sample median M d is a value such that 50% of all measurements, after having been arranged in numerical order, lie above (or below) it The median M d is found as follows: 1.If the number of measurements is odd, the median is the middlemost measurement in the ordered values 2.If the number of measurements is even, the median is the average of the two middlemost measurements in the ordered values

94 94 Data: 3, 5, 8, 10, 11 median=8

95 95 Data: 3, 3, 4, 5, 7, 8 median=(4+5)/2=4.5

96 Example: Sample Median Internist’s Yearly Salaries (x$1000) Internist’s Yearly Salaries (x$1000) 127 132 138 141 144 146 152 154 165 171 177 192 241 127 132 138 141 144 146 152 154 165 171 177 192 241 Because n = 13 (odd,) then the median is the middlemost or 7 th value of the ordered data, so M d =152  An annual salary of $180,000 is in the high end, well above the median salary of $152,000 In fact, $180,000 a very high and competitive salaryIn fact, $180,000 a very high and competitive salary Example 2.3 Example 2.3

97 97 Data: 2, 2, 2, 3, 3, 12mean=4 median=(2+3)/2=2.5

98 The Mode( 众数 ) The mode M o of a population or sample of measurements is the measurement that occurs most frequently Modes are the values that are observed “most typically” Sometimes higher frequencies at two or more values If there are two modes, the data is bimodal If more than two modes, the data is multimodal When data are in classes, the class with the highest frequency is the modal class The tallest box in the histogram

99 Example 2.4 Example 2.4 DVD Recorder Satisfaction Satisfaction rankings on a scale of 1 (not satisfied) to 10 (extremely satisfied), arranged in increasing order 1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10 Because n = 20 (even,) then the median is the average of two middlemost ratings; these are the 10 th and 11 th values. Both of these are 8 (circled), so M d = 8 Because te rating 8 occurs with the highest rating, M o = 8

100 100

101 Relationships Among Mean, Median and Mode

102 Comparing Mean, Median & Mode Bell-shaped distribution: Mean = Median = Mode Right skewed distribution: Mean > Median > Mode Left-skewed distribution:Mean < Median < Mode Also:  The median is not affected by extreme values “Extreme values” are values much larger or much smaller than most of the data The median is resistant to extreme values  The mean is strongly affected by extreme values The mean is sensitive to extreme values

103 Selecting a measure of Central Tendency Usually the mean is a good measure, because it uses every score in the distribution. There are some extreme cases in which the mean is not representative (or calculable). Then the mode and the median are used. 103

104 104 Mean=(10+11*4+12*3+13+100)/10=20.3 Median=(11+12)/2=11.5 Mode=11

105 105 Mean – not computable Median=(12+13)/2=12.5 Mode – not meaningful Open-ended distributions A distribution is said to be open-ended when there is no upper limit (or lower limit) for one of the categories

106 Payment Time Case Mean=18.108 days Median=17.000 days Mode=16.000 days So:  Expect the mean payment time to be 18.108 days  A long payment time would be > 17 days and a short payment time would be 17 days and a short payment time would be < 17 days  The typical payment time is 16 days

107 Section 2.3 Measures of Variation( 变异数 ) Figure 2.31 20 Repair Times for Personal Computers at Two Service Centers  Figure 2.31 indicates that we need measures of variation to express how the two distributions differ.

108 Range( 全距 ) Largest minus the smallest measurement The Population Variance (pronounced sigma squared) ( 总体方差 ) The average of the squared deviations of all the population measurements from the the population measurements from the population mean population mean Standard Deviation (pronounced sigma) ( 标准差 ) The square root of the variance

109 The Range Range = largest measurement - smallest measurement The range measures the interval spanned by all the data Example 2.3: Internist’s Salaries (in thousands of dollars) 127 132 138 141 144 146 152 154 165 171 177 192 241 Range = 241 - 127 = 114 ($114,000)

110 The Variance Population X 1, X 2, …, X N Sample x 1, x 2, …, x n Sample Variance ss Population Variance 

111 The Variance For a population of size N, the population variance is defined as For a sample of size n, the sample variance s 2 is defined as and is a point estimate for  2

112 112 Sample variability tends to underestimate the population value

113 The Standard Deviation( 标准差 ) Population Standard Deviation,  : Sample Standard Deviation, s:

114 Example 2.5 Example 2.5 Consider the population of profit margins for five of the best big companies in America as rated by Forbes magazine on its website on March 16, 2005. These profit margins are 8%, 10%, 15%, 12% and 5%. Population Mean Population Variance Population Standard Deviation

115 Sample variance and standard deviation for first five car mileages from Table 2.1 30.8, 31.7, 30.1, 31.6, 32.1 Example 2.6 Example 2.6 The Car Mileage Case = 2.572 /4 = 0.643

116 Sample variance and standard deviation for all car mileages from Table 2.1,. The point estimate of the variance of all cars is 0.638793 mpg 2 and the point estimate of the standard deviation of all cars is 0.7992 mpg.

117 The computational formula for the sample variance Example 2.7 Example 2.7 The Payment Time Case Consider the sample of 65 payment times in Table 2.2. Therefore andDays.

118 The Empirical Rule( 经验准则 ) for Normal Populations If a population has mean  and standard deviation  and is described by a normal curve, then 1.68.26% of the population measurements lie within one standard deviation of the mean: [  2. 95.44% of the population measurements lie within two standard deviations of the mean: [  2  2  3. 99.73% of the population measurements lie within three standard deviations of the mean: [  3  3 

119 Tolerance Intervals( 容许区间 ) An Interval that contains a specified percentage of the individual measurements in a population is called a tolerance interval.  The one, two, and three standard deviation intervals around given in (1), (2) and (3) are tolerance intervals containing, respectively, 68.26 percent, 95.44 percent and 99.73 percent of the measurements in a normally distributed population.  The three-sigma interval to be a tolerance interval that contains almost all of the measurements in a normally distributed population.

120 Figure 2.32 The Empirical Rule and Tolerance Intervals

121 The Car Mileage Case The Car Mileage Case 68.26% of all individual cars will have mileages in the range 68.26% of all individual cars will have mileages in the range 95.44% of all individual cars will have mileages in the range 95.44% of all individual cars will have mileages in the range 99.73% of all individual cars will have mileages in the range 99.73% of all individual cars will have mileages in the range Example 2.8 Example 2.8 mpg

122 Skewness and the Empirical Rule  The Empirical Rule holds for normally distributed populations.  This rule also approximately holds for populations having mound-shaped (single-peaked) distributions that are not very skewed to the right or left.  For example, Recall that the distribution of 65 payment times, it indicates that the empirical rule holds.

123 Section 2.4 Percentiles, Quartiles( 四分之一分位 点 ) and Box-and-Whiskers Display For a set of measurements arranged in increasing order, the p th percentile( 百分位点 ) is a value such that p percent of the measurements fall at or below the value and (100-p) percent of the measurements fall at or above the value The first quartile Q 1 is the 25th percentile The second quartile (or median) M d is the 50th percentile The third quartile Q 3 is the 75th percentile The interquartile range IQR( 四分位距 ) is Q 3 - Q 1

124 Calculating pth percentile Calculate the index i=(p/100) ×n If i is not an integer, the next integer greater than i denotes the position of the pth percentile in the ordered arrangement. If i is an integer, then the pth percentile is the average of the measurements in position i and i+1 in the ordered arrangement.

125 Figure 2.33 Using stem-and-leaf displays to find percentiles. (a) The 75th percentile of the 65 payment times, and a five-number summary (b) The 5 th percentile of the 60 bottle design ratings and a five-number summary

126 Example 2.10 Example 2.10 DVD Recorder Satisfaction 20 customer satisfaction ratings: 1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10 Q 1 = (7+8)/2 = 7.5 M d = (8+8)/2 = 8 Q 3 = (9+9)/2 = 9 IQR = Q 3  Q 1 = 9  7.5 = 1.5

127 Five Number Summary in descriptive statistic 1. The smallest measurement 2. The first quartile, Q 1 3. The median, M d 4. The third quartile, Q 3 5. The largest measurement Displayed visually using a box-and- whiskers plot

128 Box-and-whisker plots 128 A box and whisker plot (sometimes called a boxplot) is a graph that presents information from a five-number summary. It does not show a distribution in as much detail as a stem and leaf plot or histogram does, but is especially useful for indicating whether a distribution is skewed and whether there are potential unusual observations (outliers) in the data set.

129 The box plots the: The box plots the:  first quartile, Q 1  median, M d  third quartile, Q 3  inner fences, located 1.5  IQR away from the quartiles: = Q 1 – (1.5  IQR) = Q 1 – (1.5  IQR) = Q 3 + (1.5  IQR) = Q 3 + (1.5  IQR)  outer fences, located 3  IQR away from the quartiles: = Q 1 – (3  IQR) = Q 1 – (3  IQR) = Q 3 + (3  IQR) = Q 3 + (3  IQR) The Box-and-Whiskers Plots( 盒型图 )

130 The “whiskers” are dashed lines that plot the range of the data The “whiskers” are dashed lines that plot the range of the data  A dashed line drawn from the box below Q 1 down to the smallest measurement  Another dashed line drawn from the box above Q 3 up to the largest measurement Note: Q 1, M d, Q 3, the smallest value, and the largest value are sometimes referred to as the five number summary Note: Q 1, M d, Q 3, the smallest value, and the largest value are sometimes referred to as the five number summary

131 Outliers are measurements that are very different from most of the other measurements Outliers are measurements that are very different from most of the other measurements  Because they are either very much larger or very much smaller than most of the other measurements Outliers lie beyond the fences of the box-and-whiskers plot Outliers lie beyond the fences of the box-and-whiskers plot  Measurements between the inner and outer fences are mild outliers  Measurements beyond the outer fences are severe outliers Outliers( 异常值 )

132

133 Section 2.5 Describing Qualitative Data Pie charts( 饼图 ) of the proportion (as percent) of all cars sold in the United States by different manufacturers, 1970 versus 1997

134 Bar Chart( 柱状图 ) Percentage of Automobiles Sold by Manufacturer, 1970 versus 1997

135 Pie Chart Percentage of Automobiles Sold by Manufacturer,1997

136 An Bar Chart of U.S Automobile Sales in 1997

137 Misleading Graphs and Charts: Scale Break Break the vertical scale to exaggerate effect Mean Salaries at a Major University, 2002 - 2005

138 Misleading Graphs and Charts: Scale Effects Compress vs. stretch the vertical axis to exaggerate or minimize the effect Mean Salary Increases at a Major University, 2002 - 2005

139 139 You can use simple mathematical operations (like averages) to create nonsensical “facts” that can drive whatever agenda you’d like. Example: the average wealth of the citizens of a particular town is $100,000, therefore they don’t need any government assistance. (The town consists of 1 stingy millionaire and 9 homeless people.)

140 Sometimes, some measurements are more important than others  Assign numerical “weights” to the data Weights measure relative importance of the value Calculate weighted mean as where w i is the weight assigned to the ith measurement x i Weighted Means( 加权均值 )

141 Example 2.12 Example 2.12 June 2001 unemployment rates in the U.S. by region Want the mean unemployment rate for the U.S.

142 Calculate it as a weighted mean  So that the bigger the region, the more heavily it counts in the mean The data values are the regional unemployment rates The weights are the sizes of the regional labor forces  Note that the unweigthed mean is 4.55%, which underestimates the true rate by 0.03% That is, 0.0003  144.7 million = 43,410 workers

143 Population and Sample Proportions Population X 1, X 2, …, X N p Population Proportion Sample x 1, x 2, …, x n Sample Proportion p is the point estimate of p ^ X is a qualitative variable.

144 Example 2.11 Example 2.11 The Marketing Ethics Case 117 out of 205 marketing researchers disapproved of action taken in a hypothetical scenario X = 117, number of researches who disapprove n = 205, number of researchers surveyed Sample Proportion:

145 Scatter Diagrams are used to examine possible relationships between two numerical variables The Scatter Diagram:  one variable is measured on the vertical axis and the other variable is measured on the horizontal axis Scatter Diagrams

146 Scatter Plots( 散点图 ) Restaurant Ratings: Mean Preference vs. Mean Taste Visualize the data to see patterns, especially “trends”

147 A Scatter Plot Showing a Positive Linear Relationship 147

148 A Scatter Plot Showing a Little or No Linear Relationship 148

149 A Scatter Plot Showing a Negative Linear Relationship 149


Download ppt "Instructor: Prof. Ken Tsang T.A. : Ms Lisa Liu Office: E409 Tel: 362 0606(office) 362 0630(T.A.)"

Similar presentations


Ads by Google