Data Handling/Statistics There is no substitute for books— — you need professional help! My personal favorites, from which this lecture is drawn: The Cartoon.

Slides:



Advertisements
Similar presentations
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Advertisements

Estimation in Sampling
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Chapter 8 Linear Regression © 2010 Pearson Education 1.
Physics 270 – Experimental Physics. Let say we are given a functional relationship between several measured variables Q(x, y, …) What is the uncertainty.
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
The Simple Linear Regression Model: Specification and Estimation
Chapter 11 Problems of Estimation
Chapter 12 Simple Regression
Curve-Fitting Regression
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Statistics: Data Analysis and Presentation Fr Clinic II.
Chapter 13 Introduction to Linear Regression and Correlation Analysis
The Simple Regression Model
Evaluating Hypotheses
1 The Basics of Regression Regression is a statistical technique that can ultimately be used for forecasting.
Copyright (c) Bani Mallick1 Lecture 4 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #4 Probability The bell-shaped (normal) curve Normal probability.
Estimation 8.
Estimating  When  Is Unknown
Need to know in order to do the normal dist problems How to calculate Z How to read a probability from the table, knowing Z **** how to convert table values.
Inferences About Process Quality
Statistics for Managers Using Microsoft® Excel 7th Edition
Inferential Statistics
Introduction to Regression Analysis, Chapter 13,
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Relationships Among Variables
Basic Analysis of Variance and the General Linear Model Psy 420 Andrew Ainsworth.
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
F OUNDATIONS OF S TATISTICAL I NFERENCE. D EFINITIONS Statistical inference is the process of reaching conclusions about characteristics of an entire.
Estimation of Statistical Parameters
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Ch4 Describing Relationships Between Variables. Pressure.
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.
Curve-Fitting Regression
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
STA Lecture 181 STA 291 Lecture 18 Exam II Next Tuesday 5-7pm Memorial Hall (Same place) Makeup Exam 7:15pm – 9:15pm Location TBA.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Physics 270 – Experimental Physics. Let say we are given a functional relationship between several measured variables Q(x, y, …) x ±  x and x ±  y What.
CHEMISTRY ANALYTICAL CHEMISTRY Fall Lecture 6.
LECTURE 3: ANALYSIS OF EXPERIMENTAL DATA
Inference: Probabilities and Distributions Feb , 2012.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
ANOVA, Regression and Multiple Regression March
Data Analysis, Presentation, and Statistics
Chapter 13 Sampling distributions
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Business Statistics: A First Course 5 th Edition.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
Stats Methods at IC Lecture 3: Regression.
Confidence Intervals.
Physics 114: Lecture 13 Probability Tests & Linear Fitting
Chapter 4 Basic Estimation Techniques
Data Handling/Statistics
Physics 114: Exam 2 Review Material from Weeks 7-11
Lecture Slides Elementary Statistics Thirteenth Edition
CHAPTER 29: Multiple Regression*
CHAPTER 26: Inference for Regression
Introduction to Instrumentation Engineering
Lecture # 2 MATHEMATICAL STATISTICS
Propagation of Error Berlin Chen
Propagation of Error Berlin Chen
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Data Handling/Statistics There is no substitute for books— — you need professional help! My personal favorites, from which this lecture is drawn: The Cartoon Guide to Statistics, L. Gonick & W. Smith Data Reduction in the Physical Sciences, P. R. Bevington Workshop Statistics, A. J. Rossman & B. L. Chance Numerical Recipes, W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T.Vetterling Origin 6.1 Users Manual, MicroCal Corporation

Outline Our motto What those books look like Stuff you need to be able to look up Samples & Populations Mean, Standard Deviation, Standard Error Probability Random Variables Propagation of Errors Stuff you must be able to do on a daily basis Plot Fit Interpret

Our Motto That which can be taught can be learned. The “progress” of civilization relies being able to do more and more things while thinking less and less about them. An opposing, non-CMC IGERT viewpoint

What those books look like The Cartoon Guide to Statistics

The Cartoon Guide to Statistics In this example, the author provides step-by-step analysis of the statistics of a poll. Similar logic and style tell you how to tell two populations apart, whether your measley five replicate runs truly represent the situation, etc. The Cartoon Guide gives an enjoyable account of statistics in scientific and everyday life.

An Introduction to Error Analysis A very readable text, but with enough math to be rigorous. The cover says it all – the book’s emphasis is how statistics and error analysis are important in the everyday. Author John Taylor is known as “Mr. Wizard” at Univ. of Colorado, for his popular science lectures aimed at youngsters.

Bevington Bevington is really good at introducing basic concepts, along with simple code that really, really works. Our lab uses a lot of Bevington code, often translated from Fortran to Visual Basic.

“Workshop Statistics” This book has a website full of data that it tells you how to analyze. The test cases are often pretty interesting, too. Many little shadow boxes provide info.

“Numerical Recipes” A more modern and thicker version of Bevington. Code comes in Fortran, C, Basic (others?). Includes advanced topics like digital filtering, but harder to read on the simpler things. With this plus Bevington and a lot of time, you can fit, smooth, filter practically anything.

Stuff you need to be able to look up Samples vs. Populations The world as we understand it, based on science. The world as God understands it, based on omniscience. Statistics is not art but artifice–a bridge to help us understand phenomena, based on limited observations.

Our problem Sitting behind the target, can we say with some specific level of confidence whether a circle drawn around this single arrow (a measurement) hits the bullseye (the population mean)? Measuring a molecular weight by one Zimm plot, can we say with any certainty that we have obtained the same answer God would have gotten?

Sample View Population View Average Variance Standard deviation Standard error of mean

Sample View: direct, experimental, tangible The single most important thing about this is the reduction In standard deviation or standard error of the mean according To inverse root n. Three times better takes 9 times longer (or costs 9 times more, or takes 9 times more disk space). If you remembered nothing else from this lecture, it would be a success!

Population View: conceptual, layered with arcana! The purple equation in the table is an expression of the central limit theorem. If we measure many averages, we do not always get the same average:

Huh? It means…if you want to estimate , which only God really knows, you should measure many averages, each involving n data points, figure their standard deviation, and multiply by n 1/2. This is hard work! A lot of times,  is approximated by s. If you wanted to estimate the population average , the best you can do is to measure many averages and averaging those. A lot of times  is approximated by x. IT’S HARD TO KNOW WHAT GOD DOES. I think the  in the purple equation should be an s, but the equation only works in the limit of large n anyhow, so there is no difference.

You got to compromise, fool! The t-distribution was invented by a statistician named Gosset, who was forced by his employer (the Guinness brewery!) to publish under a pseudonym. He chose “Student” and his t-distribution is known as student’s t. The student’s t distribution helps us assign confidence in our imperfect experiments on small samples. Input: desired confidence level, estimate of population mean (or estimated probability), estimated error of the mean (or probability). Output: ± something

Probability …is another arcane concept in the “population” category: something we would like to know but cannot. As a concept, it’s wonderful. The true mean of a distribution of mass is given as the probability of that mass times the mass. The standard deviation follows a similarly simple rule. In what follows, F means a normalized frequency (think mole fraction!) and P is a probability density. P(x)dx represents the number of things (think molecules) with property x (think mass) between x+dx/2 and x-dx/2. Discrete system Continuous system

Here’s a normal probability density distribution from “Workshop…” where you use actual data to discover.  of results  of results

What it means Although you don’t usually know the distribution, (either  or  ) about 68% of your measurements will fall within  1  of  ….if the distribution is a “normal”, bell-shaped curve. t-tests allow you to kinda play this backwards: given a finite sample size, with some average, x, and standard deviation, s—inferior to  and  respectively—how far away do we think the true  is 

Details No way I could do it better than “Cartoon…” or “Workshop…” Remember…this is the part of the lecture entitled “things you must be able to look up.”

Propagation of errors Suppose you give 30 people a ruler and ask them to measure the length and width of a room. Owing to general incompetence, otherwise known as human nature, you will get not one answer but many. Your averages will be L and W, and standard deviations s W and s L. Now, you want to buy carpet, so need area A = L·W. What is the uncertainty in A due to the measurement errors in L and W? Answer! There is no telling….but you have several options to estimate it.

A = L·W example Here are your measured data: You can consider “most” and “least” cases:

Another way We can use a formula for how  propagates. Suppose some function y (think area) depends on two measured quantities t and s (think length & width). Then the variance in y follows this rule: Aren’t you glad you took partial differential equations? What??!! You didn’t? Well, sign up. PDE is the bare minimum math for scientists.

Translation in our case, where A = L·W: Problem: we don’t know W, L,  L or  W ! These are population numbers we could only get if we had the entire planet measure this particular room. We therefore assume that our measurement set is large enough (n=30) That we can use our measured averages for W and L and our standard deviations for  L and  W.

Error propagation caveats The equation,, assumes normal behavior. Large systematic errors—for example, 3 euroguys who report their values in metric units—are not taken into consideration properly. In many cases, there will be good knowledge a priori about the uncertainty in one or more parameters: in photon counting, if N is the number of photons detected, then  N = (N) 1/2. Systematic error that is not included in this estimate, so photon folk are well advised to just repeat experiments to determine real standard deviations that do take systematic errors into account.

Stuff you must know how to do on daily basis Plot!!!  r= r 2 = % of the trend can be explained by the fitted relation. Intercept = ± 45 (i.e., zero!)

The same data How to find this file! r=0.444 r 2 =0.20 Only 20% of the data can be explained by the line! While  depended on q 2, D app does not!

What does the famous “ r 2 ” really tell us? Suppose you invented a new polymer that you hoped was more stable over time than its predecessor… So you check timemelting point

timemelting point Question: What describes the data better: A simple average (meaning things aren’t really changing over time: it is stable) OR A trend (meaning melting point might be dropping over time)?

How well does the mean describe the data? These are called ‘residuals.’ The sum of the square of all the residuals characterizes how well the data fit the mean. (= )

How much better is a fit (i.e., a regression in this case)? The regression also has residuals. The sum of their squares is smaller than S t. (= )

The r 2 value simply compares the fit to the mean, by comparing the sums of the squares: In our case, the fit was NOT a dramatic improvement, explaining only 7.9% of the variability of the data!

Plot showing 95% confidence limits. Excel doesn’t excel at this!

Interpreting data: Life on the bleeding edge of cutting technology. Or is that bleating edge? The noise level in individual runs is much less than The run-to-run variation. That’s why many runs are a good idea. More would be good here, but we are still overcoming the shock that we can do this at all!

Correlation Caveat! Correlation  Cause. No, Correlation=Association. 58% of life expectancy is associated with TV’s. Would we save lives by sending TV’s to Uganda? Excel does not automatically provide  estimates!

Linearize it! Observant scientists are adept at seeing curvature. Train your eye by looking for defects in wallpaper, door trim, lumber bought at Home Depot, etc. And try to straighten out your data, rather than let the computer fit a nonlinear form, which it is quite happy to do! Linearity is improved by plotting Life vs. people per TV rather than TV’s per people.

These 4 plots all have the Same slopes, intercepts and r values! Plots are pictures of science, worth thousands of words in boring tables.

From whence do those lines come? Least squares fitting. “Linear Fits” the fitted coefficients appear in linear part expression. e.g.. y =a+bx+cx 2 +dx 3 An analytical “best fit” exists! “Nonlinear fits” At least some of the fitted coefficients appear in transcendental arguments. e.g., y =a+be -cx +dcos(ex) Best fit found by trial & error. Beware false solutions! Try several initial guesses!

CURVE FITTING: Fit the trend or fit the points? Earth’s mean annual temp has natural fluctuations year to year. To capture a long term trend, we don’t want to fit the points, so use a low-order polynomial regression.

BUT, The bumps and jiggles in the U.S. population data are ‘real.’ We don’t want to lose them in a simple trend.

REGRESSION: We lost the baby boom! SINGLE POLYNOMIAL: Does funny things (see 1905). SPLINE: YES: Lots of individual polynomials give us a smooth fit (especially good for interpolation).

All data points are not created equal. Since that one point has so much error (or noise) should we really worry about minimizing its square? No. We should minimize “chisquared.” is the # of degrees of freedom  n-# of parameters fitted Goodness of fit parameter that should be unity for a “fit within error”

Why is a fit based on chisquared so special? Based on chi: these two curves fit equally well! Based on |chi| (absolute value): these three curves fit equally well! Based on max(chi): outliers exert too strong an influence!

 2 caveats Chi-square lower than unity is meaningless…if you trust your  2 estimates in the first place. Fitting too many parameters will lower  2 but this may be just doing a better and better job of fitting the noise! A fit should go smoothly THROUGH the noise, not follow it! There is such a thing as enforcing a “parsimonious” fit by minimizing a quantity a bit more complicated than  2. This is done when you have a-priori information that the fitted line must be “smooth”.

Achtung! Warning! This lecture is an example of a very dangerous phenomenon: “what you need to know.” Before you were born, I took a statistics course somewhere in undergraduate school. Most of this stuff I learned from experience….um… experiments. A proper math course, or a course from LSU’s Department of Experimental Statistics would firm up your knowledge greatly. AND BUY THOSE BOOKS! YOU WILL NEED THEM!

Cool Excel/Origin Demo