Lecture # 2 MATHEMATICAL STATISTICS

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Chapter 7 Statistical Data Treatment and Evaluation
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Evaluating Hypotheses
SIMPLE LINEAR REGRESSION
Inferences About Process Quality
SIMPLE LINEAR REGRESSION
SIMPLE LINEAR REGRESSION
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved Section 10-1 Review and Preview.
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses.
Estimation of Statistical Parameters
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
CHEMISTRY ANALYTICAL CHEMISTRY Fall Lecture 6.
LECTURE 3: ANALYSIS OF EXPERIMENTAL DATA
Correlation & Regression Analysis
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses pt.1.
CORRELATION-REGULATION ANALYSIS Томский политехнический университет.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Chapter 9 Introduction to the t Statistic
Fundamentals of Data Analysis Lecture 10 Correlation and regression.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Virtual University of Pakistan
Confidence Intervals.
Virtual University of Pakistan
Inference about the slope parameter and correlation
Statistical analysis.
Physics 114: Lecture 13 Probability Tests & Linear Fitting
Chapter 7. Classification and Prediction
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
3. The X and Y samples are independent of one another.
Statistical analysis.
Chapter 11: Simple Linear Regression
Lecture Slides Elementary Statistics Twelfth Edition
PCB 3043L - General Ecology Data Analysis.
A Session On Regression Analysis
Chapter 5 STATISTICS (PART 4).
Elementary Statistics
Simple Linear Regression - Introduction
Lecture Slides Elementary Statistics Thirteenth Edition
Correlation and Regression
Chapter 9 Hypothesis Testing.
CONCEPTS OF ESTIMATION
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
6-1 Introduction To Empirical Models
Statistical Inference about Regression
Estimating the Value of a Parameter
Regression Lecture-5 Additional chapters of mathematics
SIMPLE LINEAR REGRESSION
Product moment correlation
SIMPLE LINEAR REGRESSION
Lecture Slides Elementary Statistics Twelfth Edition
Correlation and Simple Linear Regression
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Lecture # 2 MATHEMATICAL STATISTICS

Plan of the lecture Key concepts in mathematical statistics Methods of point estimating of the statistical characteristics of a random value by a sample Method of interval estimating of the statistical characteristic of a random value by a sample Testing of statistical hypotheses Correlation dependence between random variables. Regression Estimating of regression function by sample Method of estimating of correlation dependence between qualitative (not quantitative) attributes

Mathematical Statistics is a science about mathematical methods of analyzing, systematization and using of statistical data for solving scientific and practical problems. Key Concepts of Mathematical Statistics 1) Assembly is a collection of objects (assembly elements) having at least one common property. 2) Assembly size is a number of assembly elements (N or n).

Two Types of Assembles: Entire assembly Sample assembly (Sample) Entire assembly (population) is the largest assembly uniting all the elements having at least one common property (sign). Sample assembly (Sample) is a part of the entire assembly selected for studying.

For example, the students of our university have at least one common property – “student”. So they form entire assembly. We want to study their “height”. Randomly we choose 100 of them. These students form sample assembly.

Conclusions received from studying of sample are valuable, if they can be applied not only for the elements studied, but also for all elements with the common property, i.e. for all elements of the entire assembly. The entire assembly usually includes a very large, and often also, an infinitely large volume of data, rendering it impossible to study the entire assembly. Therefore, only part of the entire assembly is studied.

Results of sample properties studying must show (reflect) properties of entire assembly. For this the sample must be REPRESENTATIONAL, that is satisfy the conditions: 1 - objects for sample are selected randomly (in a random manner) 2 – assembly size is sufficiently large

M(X) - mathematical expectation, D(X) - variance, When studying assembly, we have to not just represent all our data in a table, but to give such numerical characteristics, which would characterize properties of the assembly. Numerical parameters characterising the assemblies are known as statistical characteristics of assembly: M(X) - mathematical expectation, D(X) - variance, σ(X) - standard deviation.

When studying a sample it is impossible to determine exactly the values of the statistical characteristics of an entire assembly, but these values can be estimated with greater or lesser accuracy. The variables obtained when studying a sample assembly are used instead of the true values of statistical characteristics of the entire assembly. They are known as estimates (sample estimates) of statistical characteristics.

For entire assembly values of statistical characteristics are called TRUE VALUES: M(X), D(X), σ(X) For sample assembly they are called ESTIMATED VALUES:

Estimation of statistical characteristics by a sample can be provided by two methods: Method of Point Estimation of the statistical characteristic 2. Method of Confidence Interval Estimation of the statistical characteristic

Method of Point Estimation of the statistical characteristics of random variable by a sample 1) Optimal sample estimate of mathematical expectation of random variable Х ( ): is mean sample Х is a random variable х1, х2, ..., хn are values of random variable X n is sample size

2) Optimal sample estimate of variance of random variable Х ( ):

3) Optimal sample estimate of the standard deviation of variable X (S(X)): 4) Error in mean mx (another notation is ):

2. Method of Confidence Interval Estimation of the statistical characteristic of a random variable by a sample The estimate of a statistical characteristic found from the sample is a random value, and its deviation from the true value of the statistical characteristic can be indefinitely large. Therefore the interval can be determined, within which the true value falls with a certain acceptable probability  . Such an interval is known as the confidence interval.

Confidence interval for a statistical characteristic is a random interval which covers the true value of the characteristic with the given probability . Boundaries of the confidence interval are defined completely by the results of trials, and are also random values. Probability  is known as confidence probability. Probability p = 1-α is known as significance level. In medical-biological studies, it is usually taken that α = 0.95; p = 1 - 0.95 = 0.05

for mathematical expectation M(X) Confidence interval for mathematical expectation M(X) (if random variable has a normal distribution) 1) If the variance is estimated by the sample fairly accurately (n≥30), the sample variance estimate can be accepted as the true variance value, i.e. consider quantity D(X) as known.

t(α) is the argument of Laplace function is the sample mean; n is the sample size; t(α) is the argument of Laplace function If α = 0.95 , t(α)=1.96

2) If variance D(X) is unknown, and n<30 S is the sample estimate of the standard deviation; k is the number of degrees of freedom (k=n–1); t(α, k) is Student's coefficient (from the table)

Testing of Statistical Hypotheses Point and interval estimating of statistical characteristics of the assemblies studied is the most common initial stage of statistical data analysis. The following stage consists in statement and testing of hypotheses related to the assemblies studied. Testing of statistical hypotheses is a large significant section of mathematical statistics. Let us consider a simple problem of this kind, viz. testing the hypothesis on the validity of the difference of mean values for two sampled assemblies.

This problem often occurs in practice when sampling is done under different conditions. For example, a sample contains the measurements of a quantity affected by a certain factor, whereas the other assembly was not affected. It is necessary to find out whether these differences are purely random, or they are due to different conditions affecting sampling. In other words, it is necessary to find out whether the samples studied belong to one entire assembly, or to different ones. In the former case, the differences between the sample means are purely random. In the latter case, the difference of sample means is valid.

If Т > t, one can say that the difference of sample means is valid. To answer the question posed, the following analysis procedure is applied. We compute the values where k is the number of degrees of freedom; n1 and n2 are the sizes of samples compared; S1 and S2 are the sampled estimates of the standard deviations of the first and second samples respectively; and and are the sample means. Then, using the Student's coefficients table, for the computed value k and the specified significance level р, we find the value of Student's coefficient t. If Т > t, one can say that the difference of sample means is valid. It cannot be accounted for by random factors alone, and it stems from different entire assemblies. If Т < t, the difference is invalid.

If the sizes of both samples are large and approximately equal, to compute Т, a simpler formula can be used where m1 and m2 are errors in mean for the first and second samples respectively. We emphasise that the method of determining the validity of the difference of two sample means being considered is strictly valid only then when variables Х1 and Х2 have a normal distribution.

CORRELATION DEPENDENCE BETWEEN RANDOM VARIABLES. REGRESSION If we consider random values, the relationship between them can be not only functional. For example, children grow taller with age, i.e. there is an objective dependence between the height and age of children. At the same time, this dependence is not functional since children of the same age are of significantly different height. Clearly, the height at the given age is a continuous random variable having a certain distribution. In this case, the relationship between variables consists in that each value of one random variable relates to a certain distribution law of the other variable. Here we refer not to the random variable probability density, but to the conditional probability density.

The conditional probability density of variable Y(f(Y/x)) is the probability density of variable at the given value of variable X. If there exist conditional probability densities of variables Y(f(Y/x)) and X(φ(X/y)), then correlation dependence is said to exist between variables X and Y.

If the probability density of a random variable depends on the value of random variable , then the expectation of variable also depends on this value, so we can speak about the conditional expectation of random variable Y at the given value of variable X M(Y/x). Hence, the conditional expectation of variable Y is a function of variable X, or in mathematical form where function Ψ(x) is known as the regression function of Y on X. The graph of the regression function is known as the regression line. The constant factors in the mathematical expression for Ψ(x) are known as regression coefficients. The regression function of on is introduced similarly. If then function ξ(y) is the regression function of X on Y. In the majority of cases, the regression line of Y on X , and that of X on Y are different lines

Estimating of Regression Function by Sample A strict definition of the regression function involves studying the entire assembly, this being practically impossible. Therefore, an important task is to estimate the regression function by experimental data, i.e. by the sample. Let there be a sample of elements, for each of which the values of random variables Y and X are defined, it being assumed that there is a correlation dependence between these variables. If we plot points with coordinates yi and xi (i = 1, 2,…, n) on the coordinate plane XoY, we will obtain the so-called correlation field. A visual study of the correlation field can be a basis for selecting an appropriate analytical expression for the regression function.

To define an optimal analytical expression for the regression function means finding the values of the regression coefficients. The more complex is the analytical expression selected for describing the regression function, and the more coefficients it contains, the more involved is the task of sampled estimation of these coefficients. The task of computing the regression coefficients is simplest for the case of a linear regression function. The correlation field points almost never fall along a straight line. Therefore, when selecting a linear function as a regression function, it is necessary to substantiate preliminarily the assumption on the linearity of the regression function.

Using experimental data, one finds the sample estimate of the correlation coefficient (the sample correlation coefficient) in the form where R is the sample correlation coefficient. The values of the sample correlation coefficient lie in the interval -1≤R ≤ 1. If R>0, the regression functions of Y on X and X on Y are increasing functions, but if R<0, these functions are decreasing ones.

The closer the value of │R│ to unity, the closer the correlation field points cluster about a straight line, giving more reason to consider the regression function a linear one. In this case, we speak about strong correlation dependence. The closer the value of to zero, the worse the correlation points cluster about a straight line, giving less reason to consider the regression function a linear one. At the same time, a small value of the correlation coefficient by no means implies absence of a correlation dependence between variables Y and X. This just implies that there is no reason to consider this dependence a linear one. Therefore, the correlation coefficient is a measure of linearity of the dependence between random variables, but not a measure of the degree of dependence between these variables in general.

At a big correlation coefficient, a linear function, i. e At a big correlation coefficient, a linear function, i.e. the dependence y=ax+b can be used to describe the regression function. To determine this function, the regression coefficients should be estimated. Optimal estimates of regression coefficients are obtained by the least squares method. The essence of the least squares method consists in that the best estimates of the regression coefficients for function are considered to be those, for which the function y=Ψ(x), are considered to be those, for which the sum takes the least value. For the particular case of linear regression of the kind the values of coefficients a and b are found by minimising the sum

For this, the partial derivatives of expression with respect to variables a and b are set to zero, and the system of equations obtained is solved. As a result, we obtain the sample estimates of the regression coefficients a and b: The latter expressions can be rearranged as

In case of regression of X on Y, the regression function has the form and the regression coefficients a1 and b1 are found from formulas Note that the regression lines of Y on X , and of X on Y coincide only if │R│=1. In this case, there is a linear functional relation between variables Y and X.

The referred above method of studying the relations between the characteristics (attributes) of a sample is applicable only for quantitative attributes. At the same time, it is necessary often to investigate the relations between the characteristics of other types whose values are not expressed quantitatively. Let us consider the method of estimating the correlation dependence when at least one of the attributes is not a quantitative one. One of the forms of defining a qualitative (not quantitative) attribute for the sample elements is comparing according to the principle «more or less».

In this case we compute the rank correlation coefficient (Spearman correlation coefficient) by the formula where xi is the value of the rank for one attribute for the i-th element of the sample (n is the sample size), and yi is the value of the rank for the other attribute for the same sample element (i = 1, 2,…, n). The Spearman correlation coefficient is a measure of correlation dependence between attributes, viz. the greater the modulus of the rank correlation coefficient, the closer is the relationship.