Presentation is loading. Please wait.

Presentation is loading. Please wait.

Course in Statistics and Data analysis Course B, September 2009 Stephan Frickenhaus.

Similar presentations


Presentation on theme: "Course in Statistics and Data analysis Course B, September 2009 Stephan Frickenhaus."— Presentation transcript:

1 Course in Statistics and Data analysis Course B, September 2009 Stephan Frickenhaus

2 Outline theses my experience is: Many young researchers lack knowledge of analysis tools, so producing/sampling data is not the problem but analysing gets a problem right before publication. Once, appropriate tools are known (and: Excel is not approriate for analysis), still knowledge of methods/concepts may be missing. This course tries to tackle both …

3 schedule Day 1: 8.9., 10:00 - 16:00 Room E4005 The probability distribution, The p-value concept, statistical tests in R Day 2: 9.9., 10:00 - 16:00 Room E4005 Multivariate Analysis, Correlation tests, ANOVA, Ordination with factors and environmental data Cluster-Analysis (maybe as start of Day 3) Day 3: 10.9., 10:00 - 16:00 Glaskasten F User-driven interactive: bring your project data and we work on it

4 Contents / Setup Tool-based (program „R“) course –Install „R“ from www.r-project.orgwww.r-project.org Exploring data analysis –Graphically –Numerically Exploring what significance really is –Statistics tests no longer as black-boxes

5 DAY1 – Lecture part I With each type of data we have different methods to analyse, give examples! Data Numerical (metric) data Nominal (class) data Ordinal (ranked) data Linear: Length in cm Circular: Angle in degree Sex, Colour, Species Age group, school class, phase in cell-division examples type of data

6 First steps from data … Plot in a co-ordinate system (scatter-plot), histogram, boxplot Count in a table, barplot, piechart Count in a table, with an axis, barplot Linear: Length in cm Circular: Angle in degree Sex, Colour, Species Age group, school class, phase in cell-division

7 … to methods Check for groups, trends, correlations Check for differences, ratios Check for differences, ratios, relation to order Plot in a co-ordinate system (scatter-plot), histogram, boxplot Count in a table, barplot, piechart Count in a table, with an axis, barplot metric nomiinal ordinal

8 …to combinations of data X-Y-Plots metric nomiinal ordinal metric X-Y-plot with colors=class metric Class=color in scatter plot Check for groups/clusters

9 …towards models: multivariate data Organize data in tables Keep data of same measurement in ONE row Distinguish groups in extra column by nominal data

10 Before discussing, what we can do with such a table, lets do first steps in the tool R!

11 Start Practice with R www.r-project.org http://ftp5.gwdg.de/pub/misc/cran/

12 Lecture part II What, if the summary of data is not enough? E.g., we want to say, whether an observed mean value is probably greater than 0.5? It is not enough to conclude „We clearly find mean(x)<mean(y)“ because this may be an outcome due to small sample sizes, and in reality the means may be equal, and there is maybe no effect at all. We must define some terms to learn how to be more quantitative about such statements, like „with 1% error we can exclude that x and y are from the same population“

13 Some terms… Population : –all individuals of the kind measured –If we measure them all, we know exactly the mean value etc., the true mean –Some times we do not have it accessible –Sometimes we think it has infinitely many individuals Sample : –A subset of individuals from a population –It has, e.g., a sample mean that is not equal to the true mean (the mean of the population) –sample size : number of individuals picked

14 …more terms, for real numbered variables X Probability density function p(x) the probability to pick samples x i from X in the interval [ a,b ] Cumulative distribution function cdf(x) probability to pick an x below a

15 p(x) prob density function x p(x) ab p(x)>=0 Need not be symmetric! Full range of X makes 100%

16 cumulative distr. function x cdf(x) 1 cdf starts from 0 at the minimal possible value of X, reaches 1 at the maximal possible value of X. Here p drops to 0. cdf is monotonically increasing, because it integrates a p≥0. min(X) max(X)

17 Mean E and Standard deviation S x p(x) E(X), need not be at the maximum of p(x) S(X) measures somehow the width of p(x), i.e., the scattering of x around E(x).

18 Long-tail distributions x p(x) Some rare samples will have very large values x ! When we have few samples, we pick from these rare values maybe none!

19 What is a statistics test? Example: We have a sample x of size 6. How probable is it, that the mean of the sample x is between 2 and 2.5, although E(X)=0? To answer this: –1) we repeat many times taking samples of size 6 and count how often. –2) we need an assumption about the probability density of X and then integrate a statistics distribution of mean(x) to measure Pr(2<mean(x)<2.5) May be too expensive LATER: Can I check what the pdf of X is?

20 …influence of sample size on the mean repeat a sampling from X with sd(X)=1.0 at different sizes N Take sample means How do repeated means vary (standard deviation) Result… For high N, sd(mean) goes (central limit theorem) How for low N ??? Its given by the t-statistics t = mean(x)/(sd(x)/sqrt(N)), which depends on sample size N.

21 A first test: Test the influence of sample size How do I know how many samples I need to make a correct statement about the mean like E(X)≥0.89? „correct“ is to be quantified as the „type-I error“: How probable is it that I see the same or more extreme value by chance alone, i.e., although the population mean is 0 ? Concept of the Null-Hypothesis How shure can I be to exclude, that the population mean is not zero, also when I find a sample mean of m=0.89. So, we evaluate how probable such an outcome is, when a certain pdf(X), e.g., the normal distribution, which has an E(X)=0. To evaluate this Pr, we need a test-statistic t for it and a distribution pdf(t) to integrate for Pr.

22 T-statistics T has a complicated mathematical, its graph is similar to bell-shaped curve. It has for small sample size N longer tails (green) Pr(T>=3) Blue area= Pr(T<3)

23 T is known in R Test for sample x=c(1,2) Pr(t<3), for n=2 Upper boundary 3? t=mean(x)/sd(x)*sqrt(2) =1.5/0.707*1.44=3.0 Sample size -1 So, ~90 from 100 repeated samples will give mean below 1.5 1-pt(3,df=1) = 0.1024164 is the chance to have mean(x) greater 1.5 ! (remember, N=2), Under the assumption that x is drawn from a population with mean 0 !

24 Now the test itself: We have a sample size 2 The Null-Hypothesis Our sample is from a population with mean 0. The test that checks this is in R… Ignore this 0


Download ppt "Course in Statistics and Data analysis Course B, September 2009 Stephan Frickenhaus."

Similar presentations


Ads by Google