Introduction to Statistics Alastair Kerr, PhD. Think about these statements (discuss at end) Paraphrased from real conversations: – “We used a t-test.

Slides:



Advertisements
Similar presentations
Statistical Reasoning for everyday life
Advertisements

Descriptive Measures MARE 250 Dr. Jason Turner.
Descriptive Statistics
Statistical Tests Karen H. Hagglund, M.S.
Measures of Dispersion or Measures of Variability
1 Midterm Review Econ 240A. 2 The Big Picture The Classical Statistical Trail Descriptive Statistics Inferential Statistics Probability Discrete Random.
B a c kn e x t h o m e Parameters and Statistics statistic A statistic is a descriptive measure computed from a sample of data. parameter A parameter is.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Slides by JOHN LOUCKS St. Edward’s University.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Unit 4 – Probability and Statistics
Describing Data: Numerical
Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun.
BIOSTATISTICS II. RECAP ROLE OF BIOSATTISTICS IN PUBLIC HEALTH SOURCES AND FUNCTIONS OF VITAL STATISTICS RATES/ RATIOS/PROPORTIONS TYPES OF DATA CATEGORICAL.
(c) 2007 IUPUI SPEA K300 (4392) Outline: Numerical Methods Measures of Central Tendency Representative value Mean Median, mode, midrange Measures of Dispersion.
Objectives 1.2 Describing distributions with numbers
Numerical Descriptive Techniques
Choosing and using statistics to test ecological hypotheses
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
PPA 501 – Analytical Methods in Administration Lecture 5a - Counting and Charting Responses.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
Measures of Dispersion CUMULATIVE FREQUENCIES INTER-QUARTILE RANGE RANGE MEAN DEVIATION VARIANCE and STANDARD DEVIATION STATISTICS: DESCRIBING VARIABILITY.
Measures of Central Tendency and Dispersion Preferred measures of central location & dispersion DispersionCentral locationType of Distribution SDMeanNormal.
RESULTS & DATA ANALYSIS. Descriptive Statistics  Descriptive (describe)  Frequencies  Percents  Measures of Central Tendency mean median mode.
STAT 280: Elementary Applied Statistics Describing Data Using Numerical Measures.
Chapter 2 Describing Data.
Introduction to Biostatistics, Harvard Extension School © Scott Evans, Ph.D.1 Descriptive Statistics, The Normal Distribution, and Standardization.
LECTURER PROF.Dr. DEMIR BAYKA AUTOMOTIVE ENGINEERING LABORATORY I.
DATA IDENTIFICATION AND ANALYSIS. Introduction  During design phase of a study, the investigator must decide which type of data will be collected and.
Descriptive Statistics1 LSSG Green Belt Training Descriptive Statistics.
Lecture 3 Describing Data Using Numerical Measures.
Describing Data Using Numerical Measures. Topics.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
Sampling Design and Analysis MTH 494 Ossam Chohan Assistant Professor CIIT Abbottabad.
Measures of Dispersion How far the data is spread out.
INVESTIGATION 1.
Dr. Serhat Eren 1 CHAPTER 6 NUMERICAL DESCRIPTORS OF DATA.
Lecture 2 Review Probabilities Probability Distributions Normal probability distributions Sampling distributions and estimation.
INVESTIGATION Data Colllection Data Presentation Tabulation Diagrams Graphs Descriptive Statistics Measures of Location Measures of Dispersion Measures.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
The use of statistics in psychology. statistics Essential Occasionally misleading.
Introduction to Statistics Santosh Kumar Director (iCISA)
Summary Statistics: Measures of Location and Dispersion.
Analisis Non-Parametrik Antonius NW Pratama MK Metodologi Penelitian Bagian Farmasi Klinik dan Komunitas Fakultas Farmasi Universitas Jember.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons. 3-1 Business Statistics, 4e by Ken Black Chapter 3 Descriptive Statistics.
LIS 570 Summarising and presenting data - Univariate analysis.
Introduction to statistics I Sophia King Rm. P24 HWB
Descriptive and Inferential Statistics Or How I Learned to Stop Worrying and Love My IA.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
1 STAT 500 – Statistics for Managers STAT 500 Statistics for Managers.
Introduction to Statistics Alastair Kerr, PhD. Overview Understanding samples and distributions Binomial and normal distributions Describing data Visualising.
Descriptive Statistics(Summary and Variability measures)
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Outline Sampling Measurement Descriptive Statistics:
Exploratory Data Analysis
Chapter 16: Exploratory data analysis: numerical summaries
Doc.RNDr.Iveta Bedáňová, Ph.D.
Chapter 5 : Describing Distributions Numerically I
Statistics.
Description of Data (Summary and Variability measures)
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Central tendency and spread
DAY 3 Sections 1.2 and 1.3.
Descriptive and inferential statistics. Confidence interval
Numerical Descriptive Measures
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Descriptive Statistics
Introductory Statistics
Descriptive and elementary statistics
Presentation transcript:

Introduction to Statistics Alastair Kerr, PhD

Think about these statements (discuss at end) Paraphrased from real conversations: – “We used a t-test to compare our samples” – “These genes are the most highly expressed in my experiment: this must be significant” – “No significant difference between these samples therefore the samples are the same” – “Yes I have replicates, I ran the same sample 3 times” – “We ignored those points, they are obviously wrong!” – “ X and Y are related as the p-value is 1e-168!” – “I need you to show this data is significant”

Basic Probability Which of these sequence of numbers is random? (outcomes 0 or 1, unsorted data)    

Binomial Distribution Thought experiment Everyone flip a coin 10 times and count number of ‘heads’  Most frequent observation? Least?  Pattern of observations between these? How would these factors affect the graph shape?  Using a dice instead of a coin and looking for the number 6?  Increasing the number of times the coin was flipped?

Binomial Distributions

Types of Data Discrete or Continuous  Discrete: values for a finite number of samples  Continuous: infinite population... Parametric or Non-parametric  Fits a known distribution  Fits specific properties  Specific tests are available if and only if the data is parametric

Normal Distribution the curve has a single peak the mean (average) lies at the centre of the distribution distribution is symmetrical around the mean the two tails of the distribution extend indefinitely and never touch the horizontal axis (continuous distribution) the shape of the distribution is determined by its Mean (µ) and Standard Deviation (σ).

Variance and standard deviation Variance is just how dispersed your data is from the mean. Formalised:  "The average of the square of the distance of each data point from the mean" Standard deviation is the square root of the variance  aka RMS [or root mean squared] deviation  Really just the distance to the mean from a ‘average’ sample

Normal distribution 95% of the data are within 2σ [standard deviation] of the mean aka the 95% confidence interval

Understanding 'average' When talking about average or mean, we commonly refer to the arithmetic mean.  sum of samples / number of samples Other Pythagorean means: geometric and harmonic geometric mean – average of factors harmonic mean – average of rates Other ways to describe  Mode - most common value  Median – central value in an ordered list of numbers

When geometric mean is useful nth root of the product of n numbers  Or mean of the log values of a dataset, converted back to base10 Factors such as ratio microarray data  e.g. for 'fold change' or other non-linear proportions less sensitive to extremely large values, it can be applied to data with relatively large fluctuations.

When harmonic mean is useful Mean of the reciprocal of values, then take the reciprocal again to convert back. Looking at ‘rates of change’  I’ve used it for the rate of change of nucleotide substitutions Gives the lowest values of all the means Good way for limiting the effect of outliers (if outliers are all large values…)

Why use median? Remember median is the central value of a ranked list What is the median of Great to use for skewed distributions Similar to the mean in a normal distribution  Why? Cannot really use SD or variance – instead quartiles and interquartile range [IQR]

Quartiles and Quantiles Quantiles are points taken at regular intervals on a ranked list of data  The 100-quantiles are called percentiles.  The 10-quantiles are called deciles.  The 5-quantiles are called quintiles.  The 4-quantiles are called quartiles. Quartiles  'middle 50', or inter quartile range [IQR] = 1 st to 3 rd quartile  first quartile (lower quartile) cuts off lowest 25% of data = 25th percentile  second quartile (median) cuts data set in half = 50th percentile  third quartile (upper quartile) cuts off highest 25% of data, or lowest 75% = 75th percentile

Visualisation: boxplot aka candlestick box = 50% of data whisker =lines dots = outliers Easy way to visualise the properties of multiple distributions beside each other

Visualisation: Cumulative Distribution Function

How does this CDF differ?

Hypothesis testing Define your question  Bad: “Is this significant?” You need to compare to a model, usually that model is random chance  Good: “Does this data differ significantly from random chance compared to this other set?”

Hypothesis testing Test a hypothesis NOT a result  Bad: Gene XYZ is the most expressed in our data set, is it significant? Ok to get hypothesis to test from eye-balling data, but define on a biological concept, not a cherry-picked data point  OK to use to build a hypothesis: cold shock protein cspC is the most expressed gene, does this experiment enrich for cold shock proteins?  OK if enough REPLICATES

Hypothesis testing 'Bayesian' analysis – model testing against is not random – Instead 'Priors” exist, knowledge of the system – Examples The 3 envelope puzzle Odds at racing

Hypothesis testing Test if parametric by using a non-parametric test against the normal distribution – e.g. Shapiro-Wilk or Anderson-Darling test Question: are samples A and B different?  Null hypothesis What is the likelihood that differences between A and B are from random chance You are testing ONE hypothesis. If it does not pass, the inverse question is not necessarily true

Testing 2 groups If Normal Distribution  Analysis of variance [ANOVA]  e.g. t-test Most powerful tests to use but data MUST resemble parametric If Non-Parametric KS [Kolmogorov-Smirnov] test (Q-Q testing) Mann-Whitney (rank sum) Chi-squared Fishers exact test if small numbers Test if parametric by using a non-parametric test against the normal distribution – e.g. Shapiro-Wilk or Anderson- Darling test Test if parametric by using a non-parametric test against the normal distribution – e.g. Shapiro-Wilk or Anderson- Darling test

P-values: multiple testing

P-values: Correlation & Causation

Replicates Your statement about your data is limited by what you tested by replication. – It may be significant but for different reasons that you think Replicates show the noise in the system: but what system? – Technical, each experimental unit Machine Variance – Pipetting variance, Temperature Variance... Biological: Changes in what you are examining. – from person to person, cell to cell, grown condition to growth condition

Define the Number of Biological Repeats

Discuss the problems with each of these “We used a t-test to compare our samples” “These genes are the most highly expressed in my experiment: this must be significant” “No significant difference between these samples therefore the samples are the same” “Yes I have replicates, I ran the same sample 3 times” “We ignored those points, they are obviously wrong!” “ X and Y are related as the p-value is 1e-168!” “I need you to show this data is significant”