Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction STATISTICS Introduction Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.

Similar presentations


Presentation on theme: "Introduction STATISTICS Introduction Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University."— Presentation transcript:

1 Introduction STATISTICS Introduction Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University

2 Lecture notes will be posted on class website – https://www.space.ntu.edu.tw/navigate/s/E2DA955C12764B48B9C04F3 6492F48D1QQY https://www.space.ntu.edu.tw/navigate/s/E2DA955C12764B48B9C04F3 6492F48D1QQY – Digital reference book: A modern introduction to probability and statistics / Dekking et al. [Electronic book] Grades – Homeworks (40%) – Midterm (30%), Final (30%) The R language will be used for data analysis. A tutorial session is arranged on Thursday (6:00 – 7:00 pm). Attendance of the tutorial session is voluntary. Class attendance rule – If you are more than 15 minutes late for the class, please do NOT enter the classroom until the next class session. 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 2

3 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 3

4 What is “ statistics ” ? Statistics is a science of “ reasoning ” from data. A body of principles and methods for extracting useful information from data, for assessing the reliability of that information, for measuring and managing risk, and for making decisions in the face of uncertainty. 10/25/2015 4 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

5 The major difference between statistics and mathematics is that statistics always needs “ observed ” data, while mathematics does not. An important feature of statistical methods is the “ uncertainty ” involved in analysis. 10/25/2015 5 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

6 Statistics is the discipline concerned with the study of variability, with the study of uncertainty and with the study of decision-making in the face of uncertainty. As these are issues that are crucial throughout the sciences and engineering, statistics is an inherently interdisciplinary science. 10/25/2015 6 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

7 Practical Applications of Statistics 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 7

8 Iris recognition – An Iris code consists of 2048 bits. – The iris code of the same person may change at different times and different places. Thus one has to allow for a certain percentage of mismatching bits when identifying a person. – Of the 2048 bits, 266 may be considered as uncorrelated. 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 8 Hamming distance is defined as the fraction of mismatches between two iris codes.

9 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 9 A modern introduction to probability and statistics : understanding why and how / Dekking et al.

10 Killer Football 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 10

11 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 11 27.2 deaths, the average over the 5 days preceding and following the match. 41

12 Poisson process modeling – Occurrences of rare events 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 12

13 Economic Warfare Analysis During World War II – In order to obtain more reliable estimates of German war production, experts from the Economic Warfare Division of the American Embassy and the British Ministry of Economic Warfare started to analyze markings and serial numbers obtained from captured German equipment. – Each piece of enemy equipment was labeled with markings, which included all or some portion of the following information: (a) the name and location of the maker; (b) the date of manufacture; (c) a serial number; and (d) miscellaneous markings such as trademarks, mold numbers, casting numbers, etc. 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 13 A modern introduction to probability and statistics : understanding why and how / Dekking et al.

14 – The first products to be analyzed were tires taken from German aircraft shot over Britain and from supply dumps of aircraft and motor vehicle tires captured in North Africa. The marking on each tire contained the maker’s name, a serial number, and a two-letter code for the date of manufacture. – The first step in analyzing the tire markings involved breaking the two-letter date code. It was conjectured that one letter represented the month and the other the year of manufacture, and that there should be 12 letter variations for the month code and 3 to 6 for the year code. This, indeed, turned out to be true. The following table presents examples of the 12 letter variations used by four different manufacturers. 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 14

15 – For each month, the serial numbers could be recoded to numbers running from 1 to some unknown largest number N. – The observed (recoded) serial numbers could be seen as a subset of this. – The objective was to estimate N for each month and each manufacturer separately by means of the observed (recoded) serial numbers. 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 15

16 – With a sample of about 1400 tires from five producers, individual monthly output figures were obtained for almost all months over a period from 1939 to mid-1943. – The following table compares the accuracy of estimates of the average monthly production of all manufacturers of the first quarter of 1943 with the statistics of the Speer Ministry that became available after the war. 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 16

17 – The accuracy of the estimates can be appreciated even more if we compare them with the figures obtained by Allied intelligence agencies. They estimated, using other methods, the production between 900 000 and 1 200 000 per month! 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 17 A modern introduction to probability and statistics : understanding why and how / Dekking et al.

18 The Monty Hall Problem 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 18

19 Standard assumptions – The host must always open a door that was not picked by the contestant. – The host must always open a door to reveal a goat and never the car. – The host must always offer the chance to switch between the originally chosen door and the remaining closed door. 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 19

20 Assuming the car is worth one million NTDs and the goat 5,000 NTDs, the expected amounts of award are – 668333.33 NTDs for the choice of switching – 336666.67 NTDs for the choice of not switching. Simulation of the Monty Hall Problem using R. 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 20

21 Ebola Outbreak in West Africa 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 21 (as of Aug. 26, 2014)

22 2014 West Africa Ebola Total cases since the beginning of the 2014 outbreak 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 22

23 2014 West Africa Ebola Total death counts since the beginning of the 2014 outbreak 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 23

24 2014 West Africa Ebola Death rate since the beginning of the 2014 outbreak 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 24

25 Spatial & Temporal Rainfall Analysis 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 25

26 臺灣防災地圖 | Google Crisis Map 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 26 http://www.google.org/crisismap/taiwan

27 Stochastic Modeling & Simulation Building probability models for real world phenomena. – No matter how sophisticated a model is, it only represents our understanding of the complicated natural systems. Generating a large number of possible realizations. Making decisions or assessing risks based on simulation results. Conducted by computers. 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 27

28 Simulation of a two-dimensional random walk 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 28 Possible applications?

29 Exploratory Data Analysis Features of data distributions – Histograms – Center: mean, median – Spread: variance, standard deviation, range – Shape: skewness, kurtosis – Order statistics and sample quantiles – Clusters – Extreme observations: outliers 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 29

30 Histogram: frequencies and relative frequencies – A sample data set X 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 30 104.838935265.018615205.279506146.93844612.577133 22.371870129.53857537.587841231.60879460.397366 24.762863275.44047770.721022100.71711033.918756 82.708815149.905426113.442704131.1448929.539663 82.535199150.761192134.931864174.200632130.360126 115.387515102.46065116.4806399.96151553.449806 64.158533133.663194139.201204112.180103105.368124 72.895810107.56904781.266071101.35163916.652365 85.55328196.92001234.20237245.472935149.996985 102.34737219.277535134.484317121.10164310.382787

31 Frequency histogram 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 31

32 Relative histogram 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 32

33 Measures of center – Sample mean – Sample median 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 33 Sample mean = 98.26067 Sample median = 101.8495

34 – One desirable property of the sample median is that it is resistant to extreme observations, in the sense that its value depends only the values of the middle observations, and is quite unaffected by the actual values of the outer observations in the ordered list. The same cannot be said for the sample mean. Any significant changes in the magnitude of an observation results in a corresponding change in the value of the mean. Hence, the sample mean is said to be sensitive to extreme observations. 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 34

35 Measures of spread – Sample variance and sample standard deviation – Range the difference between the largest and smallest values 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 35 Sample variance = 4039.931 Sample standard deviation = 63.56045 Range = 265.9008 (275.440477 – 9.539663)

36 Measures of shape – Sample skewness – Sample kurtosis 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 36 Sample skewness = 0.7110874 Sample kurtosis = 0.533141 (or 3.533141 in R)

37 Order statistics Sample quantiles 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 37 Linear interpolation

38 Box-and-whisker plot (or box plot) – A box-and-whisker plot includes two major parts – the box and the whiskers. – A parameter range determines how far the plot whiskers extend out from the box. If range is positive, the whiskers extend to the most extreme data point which is no more than range times the interquartile range (IQR) from the box. A value of zero causes the whiskers to extend to the data extremes. – Outliers are marked by points which fall beyond the whiskers. – Hinges and the five-number summary 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 38

39 10/25/2015 39 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

40 – In R, a boxplot is essentially a graphical representation determined by the 5NS. 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 40 Not “linear interpolation” The summary function in R yields a list of six numbers:

41 – Box-and-whisker plot of X 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 41

42 Seasonal variation of average monthly rainfalls in CDZ, Myanmar – Boxplots are based on average monthly rainfalls of 54 rainfall stations. 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 42

43 Random Experiment and Sample Space An experiment that can be repeated under the same (or uniform) conditions, but whose outcome cannot be predicted in advance, even when the same experiment has been performed many times, is called a random experiment. 10/25/2015 43 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

44 Examples of random experiments – The tossing of a coin. – The roll of a die. – The selection of a numbered ball (1-50) in an urn. (selection with replacement) – Occurrences of earthquakes The time interval between the occurrences of two consecutive higher-than-scale 6 earthquakes. – Occurrences of typhoons The amount of rainfalls produced by typhoons in one year (yearly typhoon rainfalls). 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 44

45 The following items are always associated with a random experiment: – Sample space. The set of all possible outcomes, denoted by . – Outcomes. Elements of the sample space, denoted by . These are also referred to as sample points or realizations. – Events. Subsets of  for which the probability is defined. Events are denoted by capital Latin letters (e.g., A, B, C ). 10/25/2015 45 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

46 Definition of Probability Classical probability Frequency probability Probability model 10/25/2015 46 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

47 Classical (or a priori) probability If a random experiment can result in n mutually exclusive and equally likely outcomes and if n A of these outcomes have an attribute A, then the probability of A is the fraction n A /n. 10/25/2015 47 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

48 Example 1. Compute the probability of getting two heads if a fair coin is tossed twice. (1/4) Example 2. The probability that a card drawn from an ordinary well-shuffled deck will be an ace or a spade. (16/52) 10/25/2015 48 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

49 Remarks The probabilities determined by the classical definition are called “ a priori ” probabilities since they can be derived purely by deductive reasoning. 10/25/2015 49 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

50 The “ equally likely ” assumption requires the experiment to be carried out in such a way that the assumption is realistic; such as, using a balanced coin, using a die that is not loaded, using a well-shuffled deck of cards, using random sampling, and so forth. This assumption also requires that the sample space is appropriately defined. 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 50

51 Troublesome limitations in the classical definition of probability: – If the number of possible outcomes is infinite; – If possible outcomes are not equally likely. 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 51

52 Relative frequency (or a posteriori) probability We observe outcomes of a random experiment which is repeated many times. We postulate a number p which is the probability of an event, and approximate p by the relative frequency f with which the repeated observations satisfy the event. 10/25/2015 52 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

53 Suppose a random experiment is repeated n times under uniform conditions, and if event A occurred n A times, then the relative frequency for which A occurs is f n (A) = n A /n. If the limit of f n (A) as n approaches infinity exists then one can assign the probability of A by: P(A)=. 10/25/2015 53 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

54 This method requires the existence of the limit of the relative frequencies. This property is known as statistical regularity. This property will be satisfied if the trials are independent and are performed under uniform conditions. 10/25/2015 54 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

55 Example 3 A fair coin was tossed 100 times with 54 occurrences of head. The probability of head occurrence for each toss is estimated to be 0.54. 10/25/2015 55 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

56 The chain of probability definition 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 56 Random experiment Sample space Event space Probability space

57 Probability Model 10/25/2015 57 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

58 Event and event space An event is a subset of the sample space. The class of all events associated with a given random experiment is defined to be the event space. 10/25/2015 58 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

59 Remarks 10/25/2015 59 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

60 Probability is a mapping of sets to numbers. Probability is not a mapping of the sample space to numbers. – The expression is not defined. However, for a singleton event, is defined. 10/25/2015 60 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

61 Probability space A probability space is the triplet ( , A, P[  ]), where  is a sample space, A is an event space, and P[  ] is a probability function with domain A. A probability space constitutes a complete probabilistic description of a random experiment. –The sample space  defines all of the possible outcomes, the event space A defines all possible things that could be observed as a result of an experiment, and the probability P defines the degree of belief or evidential support associated with the experiment. 10/25/2015 61 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

62 Conditional probability 10/25/2015 62 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

63 Bayes ’ theorem 10/25/2015 63 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

64 Multiplication rule 10/25/2015 64 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

65 Independent events 10/25/2015 65 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

66 The property of independence of two events A and B and the property that A and B are mutually exclusive are distinct, though related, properties. If A and B are mutually exclusive events then AB= . Therefore, P(AB) = 0. Whereas, if A and B are independent events then P(AB) = P(A)P(B). Events A and B will be mutually exclusive and independent events only if P(AB)=P(A)P(B)=0, that is, at least one of A or B has zero probability. 10/25/2015 66 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

67 But if A and B are mutually exclusive events and both have nonzero probabilities then it is impossible for them to be independent events. Likewise, if A and B are independent events and both have nonzero probabilities then it is impossible for them to be mutually exclusive. 10/25/2015 67 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University

68 Reading assignments IPSUR – Chapter 2 – Chapter 3 3.1.1, 3.1.3, 3.1.4 3.3 3.4.3, 3.4.4, 3.4.5, 3.4.6, 3.4.7 AMIPS – Chapter 2 – Chapter 3 10/25/2015 Lab for Remote Sensing Hydrology and Spatial Modeling Department of Bioenvironmental Systems Engineering, National Taiwan University 68


Download ppt "Introduction STATISTICS Introduction Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University."

Similar presentations


Ads by Google