Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical techniques 4 Credits [3-1-0] Course coordinator: Prateek Sharma.

Similar presentations


Presentation on theme: "Statistical techniques 4 Credits [3-1-0] Course coordinator: Prateek Sharma."— Presentation transcript:

1

2 Statistical techniques 4 Credits [3-1-0] Course coordinator: Prateek Sharma

3 Learning Objectives Need for studying environmental statistics Become aware of a wide range of applications of statistics in environmental management & decision making Define statistics Differentiate between descriptive and inferential statistics

4 Course Domain The course is intended to provide students to develop a comprehensive and understandable framework for applying statistical methods to various types of environmental problems. It includes –Grasping the language of statistics through descriptive statistics –Developing sampling design to determine "where to sample", "when to sample", "how to sample" and "how much to sample"; –Understanding the theory behind the connect between the sample and the population; –making generalization and inferences about the population from the collected samples –Development of statistical models for environmental decision making and data analysis Thus the typical uses of statistical methods are analyzing environmental monitoring data, describing the frequency distribution of exposures of the population, ascertaining the degree of compliance with standards of the environmental monitoring data, and predicting the impact of pollutant source reductions on the quality of environment.

5 Course outline Introduction Relevance of statistics, mathematical models – deterministic and stochastic; random variables; populations and samples; parameters and statistics. Review of basic concepts Measurement theory, levels of measurement; numerical measures of data; graphical presentation of data; Chebyshev’s theorem; measurement uncertainty. Probability theory Probability concepts; axioms of probability; probability distribution functions and their applications – discrete and continuous distributions.

6 Data sampling Methods for selecting sampling locations and times; types of sampling designs –probability and non-probability sampling; sampling theory, sampling distributions; parameter estimation, point and interval estimates; sample size determination for different sampling designs. Tests of hypothesis Hypothesis testing – parametric and non-parametric tests. Quality assurance and quality control Quality assurance, internal and external quality control; control charts – description and theory, application and limitations; outlier detection – different tests for outlier detection; errors, different types of errors; error propagation.

7 Data analysis Exploratory data analysis; techniques for smoothing data; correlation, serial correlation; parameter estimation using method of least squares; empirical model building by linear regression; coefficient of determination; calibration; trend analysis - detecting and estimating trend, trends and seasonality. -----------------------

8 Evaluation criteria 1.Two minor exams each of 15% weightage Tentative dates First Minor:February 14, 2011 Second Minor:March 28, 2011 2.Assignment of 20% weightage 3.Major examination of 50% weightage. Tentative date:May 09, 2011.

9 Need for studying environmental statistics Consider following questions What is the probability of exceedence of NAAQS of a criteria pollutant at a given receptor location ? How many soil samples should be collected in an area contaminated by a toxicant to give 95% assurance that a threshold limit is not accidentally overlooked ? What is the probability that a water quality standard is violated in a given 24 hours period as a result of effluent discharged into a river ? At one of the National Ambient Air Quality Monitoring Stations (NAAQMS) run by the Central Pollution Control Board (CPCB) in Delhi, the probability that the National Ambient Air Quality Standard (NAAQS) for the pollutants CO, NOx and SO x will be exceeded, is respectively 0.24, 0.19 and 0.09; the probablity that NAAQS for CO and NO x to exceed is 0.06, for CO and SO x to exceed is 0.16, for NO x and SO x to exceed is 0.11; and the probability of NAAQS exceedence for CO, NO x as well as SO x is 0.04. –Determine the probability that the NAAQS is exceeded for any of the pollutants.

10 A company investing in renewable energy uses three different brands of hydropower turbines. Of its total installation, 50% are brand 1, 30% are brand 2 and 20% are brand 3. Each manufacturer offers a 1 year warranty on parts and labour. It is known that 25% of brand 1 turbines require repair within warranty period, whereas the corresponding percentages for brand 2 and 3 are 20% and 10% respectively. If a randomly selected turbine needs repair under warranty, what is the probability that it is a brand 1 turbine? If the probability of allergic reaction from a certain drug is 0.001, compute the chance that out of 2000 individuals more than 2 will get allergy.

11 A fragment of sandstone has been found in a streambed a student field party has to search for the deposit. Unfortunately, the source of the rock cannot be identified with certainty because it was found below the juncture of two dried stream tributaries. The drainage basin of the larger stream contains about 18 km2, while the basin drained by the smaller stream includes only about 10 km2. However, an examination of a geologic report and map of the region discloses the additional information that about 35% of the rock outcrops in the larger basin are of marine origin while almost 80% of the rock outcrops in the smaller basin are of marine origin, the remaining being of the igneous type. What is the likelihood that the fragment of sandstone came from the smaller basin?

12 Contd… In a suburban neighbourhood in Mumbai, 20% of the homes have slab foundations, and 80% do not. Research studies suggest that 75% of the homes with slab foundations in this region have indoor radon problems due to intrusion of radon gas form the soil beneath. Of these homes without slab floors, 15% have indoor radon problems due to the remaining indoor sources of radon. –What is the probability of a house reporting a radon problem to have a slab foundation? –What is the probability of a house not reporting a radon problem to have a slab foundation? If 10 in 100 cars in Delhi violate the emission norms, what is the probability that a traffic police inspector, who randomly selects 4 cars for inspection, will catch –none of the cars that violate the emission norms; –one of the cars that violate the emissions norms; –at least three cars that violate the emissions norms.

13 Contd… A Quality Control (QC) team from a government agency was assigned to assess the measurement process for Nitrate concentration of a laboratory. The QC team randomly inserted 15 specimens having known concentration of 10.0 mg/L into routine work of the laboratory, over a period of one week. The work was arranged so that the observed values would be random and independent. The chemists were ignorant of the fact that their performance was being assessed. The results in the order of observations, in mg/l, were: 8.8, 9.9, 10.7, 11.7, 10.1, 8.6, 11.4, 12.1, 10.7, 6.9, 7.4, 7.2, 12.7, 8.5, 10.3. –Assuming the replication process averages out the random errors, estimate the bias associated with the measurement process.

14 Department of Agriculture routinely projects future grain production, based on estimates from sample. A sample of 100 plots (50 hectares each) produced a mean of 2.16 tonnes / acre. The Department assumes a population standard deviation of 0.278 tonnes. Calculate the 95% confidence interval. It is understood that in a certain area 60% of the population have secondary sources of income other than farming, their primary source of income. A random sample of 600 persons from the site indicates 354 (59%) to have off-farm jobs. Determine whether the proportion of these farmers holding secondary jobs is truly 60%, using α = 0.01.

15 Average expected price of a transformer part over the next year is projected by a journal to be no more than Rs. 70. A manufacturer consults 15 market experts to obtain the mean projected price as Rs 75 with a standard deviation of Rs 5. Using α = 0.01, test the hypothesis that the population mean is Rs 70. A leading GIS company arranged a special summer training programme for students of a reputed University. The scores obtained by a random sample of 10 students are given below. Use α = 0.10 to determine whether there is a significant improvement in knowledge of the students after attending the training programme.

16 Four students (A-D) each perform an analysis in which exactly 10.00 ml of exactly 0.1 M sodium hydroxide is titrated with exactly 0.1 M hydrochloric acid. Each student performs five replicate titrations, with the results shown in the table below. Comment on the accuracy, bias, and precision of each student. Student AStudent BStudent CStudent D 10.089.8810.1910.04 10.1110.149.799.98 10.0910.029.6910.02 10.109.8010.059.97 10.1210.219.7810.04

17 Contd… Following is the general assessment for the flow and concentration fluctuations for four different streams. Stream Concentration Fluctuation Flow Fluctuation A SmallLarge B LargeLarge C SmallSmall D LargeSmall Suggest suitable choice of sampling methods/technique for collecting representative sample from the respective water matrix. Estimate is required for average DO in similar streams in a region. What is the minimum number of observations required to estimate the mean DO within  0.5 mg/l with a given level of confidence, say 95%?

18 The average annual rainfall at a certain locality is 30.0 inches. This value has been established from a long history of weather data. In recent years, certain climatologically changes seem to be affecting, among other things, the annual precipitation. It is hypothesized that in fact the annual rainfall has increased. The past 8 years have yielded the following annual precipitation (inches): 34.1, 33.7, 27.4, 31.1, 30.9, 35.2, 28.4, 32.1. Can we conclude from the above data that there is an increase in annual rainfall? The discharge permit for an industry requires the monthly average COD concentration to be less than 50 mg/L. For this 20 measurements are taken each month. For the following 20 observations, would the industry be in compliance with the standard? 45, 63, 56, 55, 52, 49, 44, 49, 56, 71, 44, 51, 50, 49, 42, 46, 52, 59, 48, 51.

19 A small lake is fed by streams from a watershed that has a high density of commercial land use (CLU), and a watershed that is mainly residential (RLU). The historical data below for chloride concentration (in mg/L) were collected at random intervals over a period of fours years. Are the chloride concentrations of the two streams different? CLU: 140 134 130 132 135 145 118 157 RLU: 120 114 142 100 100 92 122 97 145 130 Two atomic absorption spectrophotometers (AAS) were used determine antimony in the atmosphere. For samples from an urban atmosphere the following results were obtained (in mg/m 3 ): Sample No.:123456 AAS No. 1:22.219.215.720.419.615.7 AAS No. 2:25.019.521.320.716.616.8 Can we conclude that two AASs have the same precision?

20 A large portion of chromium contaminated water was divided into 20 identical samples. Five samples were sent to each of four laboratories and the following data were produced. Are the laboratories making consistent measurements? The numbers of glassware breakages reported by five laboratory workers in an Environmental Monitoring Laboratory over a given period are shown below. Is there any evidence that the workers differ in their reliability. 24, 17, 11, 9, 19. Laboratory ILaboratory IILaboratory IIILaboratory IV 26.118.319.130.7 21.519.713.927.3 22.018.015.720.9 22.617.418.629.0 24.922.619.120.9

21 Two different heating systems (natural gas and cogeneration) are offered for greenhouse use. An agricultural engineer is interested in determining if there is a difference in cost of operation between the two systems. A sample of 16 greenhouses using natural gas produces an average annual cost of Rs 1750000 with a standard deviation of Rs. 40000. Another sample of 14 green houses with space capacity equal to sample one produces an average annual cost of Rs. 1640000, with a standard deviation of Rs. 50000. Can we conclude that average costs of the two systems are different?

22 The following table gives the logarithms of N = 60 total suspended particulate (TSP) air data that were collected on five randomly selected days each month at one of the ambient air quality stations. How can we check quality control of this data? n i /Mont h JanFebMarAprMayJunJulAugSepOctNovDec 14.12.92.53.23.12.42.12.42.83.23.83.9 23.83.02.63.13.22.52.12.42.63.23.74.0 33.23.12.42.92.82.62.22.52.93.43.8 43.52.82.42.72.42.32.42.63.03.33.43.7 53.13.02.53.72.92.52.32.73.1 3.63.9 The following values were obtained fro the nitrite concentration (mg/l) in a sample of stream water: 0.34, 0.36, 0.32, 0.35, 0.50. Can we reject the last measurement, which appears to be suspect (an outlier)?

23 Yields of maize (in tonnes per hectare) collected from ten field plots and the amount of fertilizer applied (in kilograms per hectare) are furnished in a table below. Check whether X and Y are linearly related at a significance level of α = 0.05. Yield Y (tonnes / hectare)5.05.76.06.26.36.56.87.06.96.6 Fertilizer X (kgs/hectare)5101218253036404548 After testing generators that can run on bio-diesel produced at a plant during demonstration run, the manager wanted to assess whether the number of defects found in the sets follow Poisson distribution. Apply χ 2 test of goodness of fit to the test results given below No. of defects012345 Observed frequency613 843 Expected frequency (Poisson)6.2413.52 9.014.51.8

24 Wolfer sunspot numbers are an index of activity on the solar surface. They have been investigated for their impact on terrestrial climate and for the resulting environmental effects. Twenty annual observations are listed here fir the period 1770-1789: 101 82 66 35 31 7 20 92 154 125 85 68 38 23 10 24 83 132 131 118. Can we ascertain any trend from the data set?

25 The modelling approach The problem fundamental to all modelling studies in physical system is the identification of the function “ F ” that would allow the prediction of the pollutant physical quantity of interest C(x, y, z, t) at any point in space (x, y, z) and time (t) if the pollutant loading and other system physical variables are given. Three different approaches have established to identify “ F ” Deterministic mathematical modelling –Analytical models –Numerical models Statistical modelling Physical modelling

26 Approaches to analyse any system, phenomenon or process Any phenomenon/process or system Deterministic approach Stochastic approachPhysical approach

27 Statistics - Introduction The processing of statistical information has a history that extends back to the beginning of mankind. In early biblical times nations compiled statistical data to provide descriptive information relative to all sorts of things, such as taxes, wars, agricultural crops, and even athletic events. Today, with the development of probability theory, we are able to use statistical methods that not only describe important features of the data but methods that allow us to proceed beyond the collected data into the area of decision making through generalisations and predictions.

28 Statistics was developed to assist in those areas where laws of cause and effect are not apparent to the observer and where an objective approach is needed. Historically, many environmental studies were qualitative than quantitative. However, in recent years the need to develop and use quantitative mathematical analysis has become apparent to environmental researcher and policy makers.

29 28 What is Statistics… Science of gathering, analyzing, interpreting, and presenting data Facts and figures Measurement taken on a sample It is the technique of drawing inferences about the population from the collected samples Statistics is the science of understanding the “order” behind “disordered array of numbers”. It provides a tool to understand the “process of generation” of “groups of numbers” in efficient and objective way.

30 Concept of a random variable A random process When the outcome of a phenomenon or process or experiment is dependent on several causative variables, some which may or may not be known to the analyst, the process is known as random process. Important note: If all of the causative variables were known, and the cause-effect relationships were well-understood, then the process would be deterministic. In deterministic process, the outcome is known to the analyst with certainty.

31 The outcomes of a random process, when represented in terms of numbers, will thus be variable. A random variable, usually written X, is a variable whose all possible values are numerical outcomes of a random phenomenon. There are two types of random variables - discrete and continuous. A discrete random variable may assume either a finite number of values or an infinite sequence of values.

32 Discrete random variable with a finite number of values Let x = number of TV sets sold at the store in one day where x can take on 5 values (0, 1, 2, 3, 4) Discrete random variable with an infinite sequence of values Let x = number of customers arriving in one day where x can take on the values 0, 1, 2,... We can count the customers arriving, but there is no finite upper limit on the number that might arrive. Examples

33 A continuous random variable may assume any numerical value in an interval or collection of intervals. Examples include height, weight, the amount of sugar in an orange, the time required to run a mile.

34 Key Definitions A population (universe) is the collection of things under consideration A sample is a portion of the population selected for analysis A pararmeter is a summary measure computed to describe a characteristic of the population A statistic is a summary measure computed to describe a characteristic of the sample

35 Population and Sample PopulationSample Use parameters to summarise features Use statistic to summarise features Inference on the population from the sample

36 Statistics Descriptive statistics Used for summarising the information contained in an array of numbers (data set) so that interpretations can be made about the underyling data generation process. Inferential statistics Used for drawing conclusions about the population using a representative sample.

37 36 Descriptive statistics Encompasses the following: –Graphical or pictorial display –Condensation of large masses of data into a form such as tables –Preparation of summary measures to give a concise description of complex information (e.g. an average figure) –Exhibition of patterns that may be found in sets of information

38 Descriptive Statistics Collect data –e.g. Survey Present data –e.g. Tables and graphs Characterize data –e.g. Sample mean =

39 Descriptive statistics Methods concerned with collecting and describing a set of data so as to yield meaningful information. Numerical descriptors (numerical summaries of data) Measures of –Central tendency –Variation (dispersion) –Position –Shape Pictorial/graphical descriptors –Graphs for single variable –Graphs for two or more variables

40 Inferential statistics Methods concerned with the analysis of a subset of data (sample) leading to predictions or inferences about the entire set of data (population). Theory of estimation Theory of hypothesis testing

41 Inferential Statistics Estimation –e.g.: Estimate the population mean weight using the sample mean weight Hypothesis testing –e.g.: Test the claim that the population mean weight is 120 pounds Drawing conclusions and/or making decisions concerning a population based on sample results.

42 41 Inferential Statistics.. Especially relates to: –Determining whether characteristics of a situation are unusual or if they have happened by chance –Estimating values of numerical quantities and determining the reliability of those estimates –Using past occurrences to attempt to predict the future

43 Data and Data Sets Data are the facts and figures that are collected, summarized, analysed, and interpreted. The data collected in a particular study are referred to as the data set.

44 Qualitative and Quantitative Data Data can be further classified as being qualitative or quantitative. The statistical analysis that is appropriate depends on whether the data for the variable are qualitative or quantitative. In general, there are more alternatives for statistical analysis when the data are quantitative.

45 Qualitative Data Qualitative data are labels or names used to identify an attribute of each element. Qualitative data use either the nominal or ordinal scale of measurement. Qualitative data can be either numeric or nonnumeric. The statistical analysis for qualitative data are rather limited.

46 Quantitative Data Quantitative data indicate either how many or how much. –Quantitative data that measure how many are discrete. –Quantitative data that measure how much are continuous because there is no separation between the possible values for the data. Quantitative data are always numeric. Ordinary arithmetic operations are meaningful only with quantitative data.

47 Measurement To bring in objectivity in our decision making towards the solution of a problem we need to “measure” the “process”/“phenomenon”, and “objects”/ “observations” within it. Measurement is the process of assigning numbers to objects or observations. Level of Measurement has to do with precision associated with that level. It depends on the “rules” under which the numbers are assigned.

48 Measurement (contd.) Measurement is important not only in data analysis but also in the selection of the appropriate statistical/mathematical treatment to which the data can be subjected to extract meaningful information. Generally speaking, the differences between the data types affect the choice of statistical technique to be used to analyse the data.

49 Data types and measurement scales Data Non-metric or qualitative Metric or quantitative Ratio scale Interval scale Ordinal scale Nominal scale

50 Measurement scales Nominal scale Ordinal scale Interval scale Ratio scale

51 Nominal scale Numbers, or symbols are used to identify groups or classes to which various objects belong. Provide convenient ways of keeping track of people, objects and events. Used to separate samples for analysis. –e.g. mean pollutant levels from different types of vehicles can be compared to see whether they differ in amount of lead they emit. Also be used for frequency analysis. –e.g. examining the number of times flow exceeds the danger levels at a river gauging station. Frequencies can only assume discrete (integer) values.

52 Nominal scale (contd.) Arithmetic can be performed on the frequencies but not on the group identification. The categories are mutually exclusive. Although numbers can be used to code each category, these are pure labels and have no value. Therefore, no mathematical operators can be used to extract meaning out of the data.

53 Nominal scale (contd.) Least powerful level of measurement Indicates no order or distance relationship and has no arithmetic origin. Statistical tools that can be employed on nominal data –Mode as a measure of central tendency –Chi-square test –Contingency coefficient as a measure of correlation

54 Ordinal scale Essential feature is that the relative order of the objects or classes can be identified but not quantified. Represents next higher level of measurement precision. Variables can be ordered or ranked with ordinal scales in relation to the amount of the attribute possessed. –e.g. strength of opinion regarding a particular topic –Complexity of environment Every subclass can be compared with another in terms “>” or “<” relationship. Numbers utilised are non-quantitative, since they indicate only relative positions in an ordered series.

55 Ordinal scale (contd.) Ordinal data are on a scale with a defined direction. –e.g.one point can be described larger than another The scale places events in order, but there is no attempt to make the intervals of the scale equal in terms of some rule. Statistical tools that can be employed on nominal data –Median is used as a measure of central tendency –Percentile or quartile is used as a measure of dispersion –Non-parametric tests

56 Interval scale In addition to inequalities, the interval sizes (difference)between groups are measurable. Mathematical operators valid: –“ ”, “+” and “-” Multiplication and division invalid Because scale has arbitrary zero I.e. the 0 on an interval scale does not indicate the complete absence of whatever quantity we are trying to measure. The scale is characterised by a “unit of measurement” that assigns a real number to the relationships (distances) between all pairs of objects or groups. All statistical tools/computations applicable except those involving fractions like coefficient of variation, geometric mean, harmonic mean etc.

57 Ratio scale Measurements with all the characteristics of an interval scale plus a physically definable zero-point (absolute zero). This scale must contain a zero value that indicates that nothing exists for the variable at the zero point. That is in addition to setting up inequalities and forming differences we can also form quotients. Mathematical operators valid: –All customary operators (“ ”, “+” & “-”, “  ” and “  ”) Most precise of all scales. Examples: length, mass, weight, flow, temperature measured in Kelvin scale etc.

58 Measurement scales – Final remark Proceeding from the nominal scale (the least precise type of scale) to ratio scale (the most precise, relavant information is obtained increasingly If the nature of variables permits, the researcher should use the scale that provides the most precise description. Generally, measurements in physical sciences are in ratio scales, however, in management, behaviour sciences measurements are restricted to interval scales. Finally, the interval scale is the first quantitative measurement scale. The nominal scale names and counts or attributes of objects, and the ordinal scale arranges objects.

59 Types of data Primary data Refers to information obtained firsthand on the variables of interest for the specific purpose of the study. Secondary data Refers to information gathered from sources already existing.

60 Data Sources Primary Data Collection Secondary Data Compilation Observation Experimentation Survey Print or Electronic


Download ppt "Statistical techniques 4 Credits [3-1-0] Course coordinator: Prateek Sharma."

Similar presentations


Ads by Google