Data Handling/Statistics There is no substitute for books— — you need professional help! My personal favorites, from which this lecture is drawn: The Cartoon.

Data Handling/Statistics There is no substitute for books— — you need professional help! My personal favorites, from which this lecture is drawn: The Cartoon Guide to Statistics, L. Gonick & W. Smith Data Reduction in the Physical Sciences, P. R. Bevington Workshop Statistics, A. J. Rossman & B. L. Chance Numerical Recipes, W.H. Press, B.P. Flannery, S.A. Teukolsky and W.T.Vetterling Origin 6.1 Users Manual, MicroCal Corporation

Outline Our motto What those books look like Stuff you need to be able to look up Samples & Populations Mean, Standard Deviation, Standard Error Probability Random Variables Propagation of Errors Stuff you must be able to do on a daily basis Plot Fit Interpret

Our Motto That which can be taught can be learned. The “progress” of civilization relies being able to do more and more things while thinking less and less about them. An opposing, non-CMC IGERT viewpoint

What those books look like The Cartoon Guide to Statistics

The Cartoon Guide to Statistics In this example, the author provides step-by-step analysis of the statistics of a poll. Similar logic and style tell you how to tell two populations apart, whether your measley five replicate runs truly represent the situation, etc. The Cartoon Guide gives an enjoyable account of statistics in scientific and everyday life.

An Introduction to Error Analysis A very readable text, but with enough math to be rigorous. The cover says it all – the book’s emphasis is how statistics and error analysis are important in the everyday. Author John Taylor is known as “Mr. Wizard” at Univ. of Colorado, for his popular science lectures aimed at youngsters.

Bevington Bevington is really good at introducing basic concepts, along with simple code that really, really works. Our lab uses a lot of Bevington code, often translated from Fortran to Visual Basic.

“Workshop Statistics” This book has a website full of data that it tells you how to analyze. The test cases are often pretty interesting, too. Many little shadow boxes provide info.

“Numerical Recipes” A more modern and thicker version of Bevington. Code comes in Fortran, C, Basic (others?). Includes advanced topics like digital filtering, but harder to read on the simpler things. With this plus Bevington and a lot of time, you can fit, smooth, filter practically anything.

Stuff you need to be able to look up Samples vs. Populations The world as we understand it, based on science. The world as God understands it, based on omniscience. Statistics is not art but artifice–a bridge to help us understand phenomena, based on limited observations.

Our problem Sitting behind the target, can we say with some specific level of confidence whether a circle drawn around this single arrow (a measurement) hits the bullseye (the population mean)? Measuring a molecular weight by one Zimm plot, can we say with any certainty that we have obtained the same answer God would have gotten?

Sample View Population View Average Variance Standard deviation Standard error of mean

Sample View: direct, experimental, tangible The single most important thing about this is the reduction In standard deviation or standard error of the mean according To inverse root n. Three times better takes 9 times longer (or costs 9 times more, or takes 9 times more disk space). If you remembered nothing else from this lecture, it would be a success!

Population View: conceptual, layered with arcana! The purple equation in the table is an expression of the central limit theorem. If we measure many averages, we do not always get the same average:

Huh? It means…if you want to estimate , which only God really knows, you should measure many averages, each involving n data points, figure their standard deviation, and multiply by n 1/2. This is hard work! A lot of times,  is approximated by s. If you wanted to estimate the population average , the best you can do is to measure many averages and averaging those. A lot of times  is approximated by x. IT’S HARD TO KNOW WHAT GOD DOES. I think the  in the purple equation should be an s, but the equation only works in the limit of large n anyhow, so there is no difference.

You got to compromise, fool! The t-distribution was invented by a statistician named Gosset, who was forced by his employer (the Guinness brewery!) to publish under a pseudonym. He chose “Student” and his t-distribution is known as student’s t. The student’s t distribution helps us assign confidence in our imperfect experiments on small samples. Input: desired confidence level, estimate of population mean (or estimated probability), estimated error of the mean (or probability). Output: ± something

Probability …is another arcane concept in the “population” category: something we would like to know but cannot. As a concept, it’s wonderful. The true mean of a distribution of mass is given as the probability of that mass times the mass. The standard deviation follows a similarly simple rule. In what follows, F means a normalized frequency (think mole fraction!) and P is a probability density. P(x)dx represents the number of things (think molecules) with property x (think mass) between x+dx/2 and x-dx/2. Discrete system Continuous system

Here’s a normal probability density distribution from “Workshop…” where you use actual data to discover.  of results  of results

What it means Although you don’t usually know the distribution, (either  or  ) about 68% of your measurements will fall within  1  of  ….if the distribution is a “normal”, bell-shaped curve. t-tests allow you to kinda play this backwards: given a finite sample size, with some average, x, and standard deviation, s—inferior to  and  respectively—how far away do we think the true  is 

Details No way I could do it better than “Cartoon…” or “Workshop…” Remember…this is the part of the lecture entitled “things you must be able to look up.”

Propagation of errors Suppose you give 30 people a ruler and ask them to measure the length and width of a room. Owing to general incompetence, otherwise known as human nature, you will get not one answer but many. Your averages will be L and W, and standard deviations s W and s L. Now, you want to buy carpet, so need area A = L·W. What is the uncertainty in A due to the measurement errors in L and W? Answer! There is no telling….but you have several options to estimate it.

A = L·W example Here are your measured data: You can consider “most” and “least” cases:

Another way We can use a formula for how  propagates. Suppose some function y (think area) depends on two measured quantities t and s (think length & width). Then the variance in y follows this rule: Aren’t you glad you took partial differential equations? What??!! You didn’t? Well, sign up. PDE is the bare minimum math for scientists.

Translation in our case, where A = L·W: Problem: we don’t know W, L,  L or  W ! These are population numbers we could only get if we had the entire planet measure this particular room. We therefore assume that our measurement set is large enough (n=30) That we can use our measured averages for W and L and our standard deviations for  L and  W.

Error propagation caveats The equation,, assumes normal behavior. Large systematic errors—for example, 3 euroguys who report their values in metric units—are not taken into consideration properly. In many cases, there will be good knowledge a priori about the uncertainty in one or more parameters: in photon counting, if N is the number of photons detected, then  N = (N) 1/2. Systematic error that is not included in this estimate, so photon folk are well advised to just repeat experiments to determine real standard deviations that do take systematic errors into account.

Stuff you must know how to do on daily basis Plot!!!  r=0.99987 r 2 =0.9997 99.97% of the trend can be explained by the fitted relation. Intercept = 0.003 ± 45 (i.e., zero!)

The same data How to find this file! r=0.444 r 2 =0.20 Only 20% of the data can be explained by the line! While  depended on q 2, D app does not!

What does the famous “ r 2 ” really tell us? Suppose you invented a new polymer that you hoped was more stable over time than its predecessor… So you check. 2 4 8 12 16 24 36 48 110.2 110.9 108.8 109.1 109.0 108.5 110.0 109.2 timemelting point

2 4 8 12 16 24 36 48 110.2 110.9 108.8 109.1 109.0 108.5 110.0 109.2 timemelting point Question: What describes the data better: A simple average (meaning things aren’t really changing over time: it is stable) OR A trend (meaning melting point might be dropping over time)?

How well does the mean describe the data? These are called ‘residuals.’ The sum of the square of all the residuals characterizes how well the data fit the mean. (= 4.6788)

How much better is a fit (i.e., a regression in this case)? The regression also has residuals. The sum of their squares is smaller than S t. (= 4.3079)

The r 2 value simply compares the fit to the mean, by comparing the sums of the squares: In our case, the fit was NOT a dramatic improvement, explaining only 7.9% of the variability of the data!

Plot showing 95% confidence limits. Excel doesn’t excel at this!

Interpreting data: Life on the bleeding edge of cutting technology. Or is that bleating edge? The noise level in individual runs is much less than The run-to-run variation. That’s why many runs are a good idea. More would be good here, but we are still overcoming the shock that we can do this at all!

Correlation Caveat! Correlation  Cause. No, Correlation=Association. 58% of life expectancy is associated with TV’s. Would we save lives by sending TV’s to Uganda? Excel does not automatically provide  estimates!

Linearize it! Observant scientists are adept at seeing curvature. Train your eye by looking for defects in wallpaper, door trim, lumber bought at Home Depot, etc. And try to straighten out your data, rather than let the computer fit a nonlinear form, which it is quite happy to do! Linearity is improved by plotting Life vs. people per TV rather than TV’s per people.

These 4 plots all have the Same slopes, intercepts and r values! Plots are pictures of science, worth thousands of words in boring tables.

From whence do those lines come? Least squares fitting. “Linear Fits” the fitted coefficients appear in linear part expression. e.g.. y =a+bx+cx 2 +dx 3 An analytical “best fit” exists! “Nonlinear fits” At least some of the fitted coefficients appear in transcendental arguments. e.g., y =a+be -cx +dcos(ex) Best fit found by trial & error. Beware false solutions! Try several initial guesses!

CURVE FITTING: Fit the trend or fit the points? Earth’s mean annual temp has natural fluctuations year to year. To capture a long term trend, we don’t want to fit the points, so use a low-order polynomial regression.

BUT, The bumps and jiggles in the U.S. population data are ‘real.’ We don’t want to lose them in a simple trend.

REGRESSION: We lost the baby boom! SINGLE POLYNOMIAL: Does funny things (see 1905). SPLINE: YES: Lots of individual polynomials give us a smooth fit (especially good for interpolation).

All data points are not created equal. Since that one point has so much error (or noise) should we really worry about minimizing its square? No. We should minimize “chisquared.” is the # of degrees of freedom  n-# of parameters fitted Goodness of fit parameter that should be unity for a “fit within error”

Why is a fit based on chisquared so special? Based on chi: these two curves fit equally well! Based on |chi| (absolute value): these three curves fit equally well! Based on max(chi): outliers exert too strong an influence!

 2 caveats Chi-square lower than unity is meaningless…if you trust your  2 estimates in the first place. Fitting too many parameters will lower  2 but this may be just doing a better and better job of fitting the noise! A fit should go smoothly THROUGH the noise, not follow it! There is such a thing as enforcing a “parsimonious” fit by minimizing a quantity a bit more complicated than  2. This is done when you have a-priori information that the fitted line must be “smooth”.

Achtung! Warning! This lecture is an example of a very dangerous phenomenon: “what you need to know.” Before you were born, I took a statistics course somewhere in undergraduate school. Most of this stuff I learned from experience….um… experiments. A proper math course, or a course from LSU’s Department of Experimental Statistics would firm up your knowledge greatly. AND BUY THOSE BOOKS! YOU WILL NEED THEM!

Cool Excel/Origin Demo

Data Handling/Statistics There is no substitute for books— — you need professional help! My personal favorites, from which this lecture is drawn: The Cartoon.

Similar presentations

Presentation on theme: "Data Handling/Statistics There is no substitute for books— — you need professional help! My personal favorites, from which this lecture is drawn: The Cartoon."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Handling/Statistics There is no substitute for books— — you need professional help! My personal favorites, from which this lecture is drawn: The Cartoon.

Similar presentations

Presentation on theme: "Data Handling/Statistics There is no substitute for books— — you need professional help! My personal favorites, from which this lecture is drawn: The Cartoon."— Presentation transcript:

Similar presentations

About project

Feedback