Download presentation
Presentation is loading. Please wait.
1
Canadian Bioinformatics Workshops
2
Module #: Title of Module
2
3
Backgrounder in Statistical Methods
Module 4 Backgrounder in Statistical Methods Jeff Xia Informatics and Statistics for Metabolomics May 26-27, 2016
4
Yesterday 1 2 3 4 5 6 7 ppm
5
Today -25 -20 -15 -10 -5 5 10 15 20 25 -30 PC1 PC2 PAP ANIT Control
6
Learning Objectives Learn about summary statistics and normal distributions Learn about univariate statistics (t-tests and ANOVA) Learn about p values calculation and hypothesis testing Learn about multivariate statistics (clustering, PCA and PLS-DA)
7
What is Statistics Data Statistics Information
A way to get information from (usually big & complex) data Statistics Information
8
Main Components Input: metabolomics data Output: useful information
A matrix containing numerical values Meta-data: data about data Class labels, experimental factors Output: useful information Significant features Clustering patterns Rules (for prediction) Models …...
9
Types of Data Y X Data Matrix Meta Data Quantitative Categorical
Discrete Continuous Categorical Binary Nominal Ordinal Data Matrix Meta Data Y X
10
Quantitative Data The data matrix Continuous Microarray intensities
Metabolite concentrations Discrete Read counts Need to be treated with different statistical models
11
Categorical Data Binary data Nominal Data (> two groups)
0/1, Y/N, Case/Control Nominal Data (> two groups) Single = 1, Married = 2, Divorced = 3, Widowed = 4 Orders are not important Ordinal data Low < Medium < High Orders matter
12
Some Jargons (I) Data are the observed values of a variable.
A variable is some characteristic of a population or sample A gene or a compound The values of the variable are the range of possible values for a variable. i.e. measurements of gene expression, metabolite concentration Dimension of data are based on the variables it contains Omics data are usually called high-dimensional data
13
Some Jargons (II) Univariate: Bivariate: Multivariate
Measuring one variable per subject Bivariate: Measuring two variables per subject Multivariate Measuring many variables per subject
14
Key Concepts in Statistics
15
Issues when making inferences
16
From samples to population
So how do we know whether the effect observed in our sample was genuine? We don’t Instead we use p values to indicate our level of certainty that our results represent a genuine effect present in the whole population Q Explain that by genuine we mean that the observed effect was caused by a true effect present within the whole population A
17
P values P values = the probability that the observed result was obtained by chance i.e. when the null hypothesis is true α level is set a priori (usually 0.05) If p < α level then we reject the null hypothesis and accept the experimental hypothesis If however, p > α level then we reject the experimental hypothesis and accept the null hypothesis More on this topic later P = Probability this value tells us the probability that the observed result was obtained by chance That there is no difference between the two groups Each test result (e.g. t value) is associated with a particular p value α level is set a priori This is basically an acceptance level Usually this is set to 0.05 But as I understand, α levels are usually much lower than this in fMRI If p < α level then we reject the null hypothesis and accept the experimental hypothesis - concluding that we are 95% certain that our experimental effect is genuine If however, p > α level then we reject the experimental hypothesis and accept the null hypothesis - that there was no sig diff in brain activation levels between the two conditions
18
Summary/Descriptive Statistics
19
How do we describe the data?
Central Tendency – center of the data location Mean, Median, Mode Variability – the spread of the data Variance Standard deviation Relative standing – distribution of data within the spread Quantiles Range IQR (inter-quantile range)
20
Mean, Median, Mode In a Normal Distribution the mean, mode and median are all equal In skewed distributions they are unequal Mean - average value, affected by extreme values in the distribution Median - the “middlemost” value, usually half way between the mode and the mean Mode - most common value
21
Mean, Median & Mode Mode Median Mean
22
Variance, SD and SEM Variance: average of the squared distance to the center (mean) It is squared, unit of change is not meaningful Increasing contributions from outliers SD: standard deviation: Square root of variance “Standardized”, unit is meaningful SEM: standard error of the mean Quantifies the precision of the mean. Takes into account both the value of the SD and the sample size. σ2 σ
23
Quantiles The 1st quantile Q1 is the value for which 25% of the observations are smaller and 75% are larger Q2 is the same as median (50% are smaller and 50% larger) Q3 is the value that only 25% of the observations are larger Range is minimum to maximum
24
Mean vs. Variance Most of univariate tests are comparing the difference in the means, assuming equal variance
25
Univariate Statistics
Univariate means a single variable If you measure a population using some single measure such as height, weight, test score, IQ, you are measuring a single variable If you plot that single variable over the whole population, measuring the frequency that a given value is reached you will get the following:
26
A Bell Curve # of each Height
Also called a Gaussian or Normal Distribution
27
Features of a Normal Distribution
Symmetric distribution Has an average or mean value (m) at the centre Has a characteristic width called the standard deviation (s) Most common type of distribution known m = mean
28
Normal Distribution Almost any set of biological or physical measurements will display some some variation and these will almost always follow a Normal distribution The larger the set of measurements, the more “normal” the curve Minimum set of measurements to get a normal distribution is 30-40
29
Some Equations Mean m = Sxi N Variance s2 = S(xi - m)2 N
Standard Deviation s = S(xi - m)2 N
30
Standard Deviation (σ)
99% 95%
31
Different Distributions
Unimodal Bimodal
32
Skewed Distribution Resembles an exponential or Poisson-like distribution Lots of extreme values far from mean or mode Hard to do useful statistical tests with this type of distribution Outliers
33
Fixing a Skewed Distribution
A skewed distribution or exponentially decaying distribution can be transformed into a “normal” or Gaussian distribution by applying a log transformation This brings the outliers a little closer to the mean because it rescales the x-variable, it also makes the distribution much more Gaussian
34
Log Transformation Normal distribution Skewed distribution log
transformed exp’t B linear scale exp’t B
35
Log Transformation (Real Data)
36
Centering, scaling, and transformations
BMC Genomics. 2006; 7: 142
37
The Result # of each Height Are they different?
38
The Result # of each Height Are they different?
39
t-tests Compare the mean between 2 samples/ conditions
if 2 samples are taken from the same population, then they should have fairly similar means if 2 means are statistically different, then the samples are likely to be drawn from 2 different populations, i.e. they really are different
40
Types of t-tests Independent Samples Related Samples parametric
Independent samples t-test Paired samples t-test non-parametric Mann-Whitney U-Test Wilcoxon test
41
Paired t-tests Reduce Variance This
42
Another approach to group differences
Analysis Of VAriance (ANOVA) Variances not means Multiple groups (> 2) H0 = no differences between groups H1 = differences between groups Based on F distribution or F-tests
43
Calculating F F = the between group variance divided by the within group variance the model variance/error variance For F to be significant the between group variance should be considerably larger than the within group variance A large value of F indicates relatively more difference between groups than within groups Evidence against H0
44
What can be concluded from a significant ANOVA?
There is a significant difference between the groups NOT where this difference lies Finding exactly where the differences lie requires further statistical analyses Post-hoc tests
45
Different types of ANOVA
One-way ANOVA One factor with more than 2 levels Factorial ANOVAs More than 1 factor Mixed design ANOVAs Some factors independent, others related The type that I have described is referred to as a one-way ANOVA because it has one factor which = cartoon characters (with more than 2 levels) Can also have two-way, three-way ANOVAs These = factorial ANOVAs Allow for possible interactions between factors as well as main effects For example you could have 2 factors with 2 levels each This would = a 2 x 2 factorial design Can also have related or independent designs or a mixture
46
Conclusions T-tests assess if two group means differ significantly
Can compare two samples or one sample to a given value ANOVAs compare more than two groups or more complicated scenarios They use variances instead of means
47
Understanding P values
48
The p-value The p-value is the probability of seeing a result as extreme or more extreme than the result from a given sample, if the null hypothesis is true How do we calculate a p value?
49
How to we compute a p value
95% 99%
50
Non-normal distribution
What if we don’t know the distribution? The only thing we know is that the data does not follow normal distribution Poor performance using normal distribution based model Common approaches Normalization Non-parametric tests Permutation (i.e. empirical p values)
51
Normalization Normal distribution is a basic assumption in many tests
T-tests, ANOVA, etc Log transformation Z-score (auto-scaling)
52
Boxplots of “standardized” data
53
Non-parametric tests Based on rank (loss of information)
Sample1 (1000, 1, ….) Sample2 (1.1, 1.09,....) Sometimes useful, but not as powerful as parametric tests when data is normally distributed or can be normalized
54
Empirical P values We previously mentioned p values is based on well defined models (i.e. normal distributions) What if we don’t know the distribution? The only thing we know is that the data does not follow normal distribution Poor performance using normal distribution based model We can find out the null distribution from the data itself (need to be relatively big), then calculate the p value (also known as empirical p value)
55
Basic Principle Under the null hypothesis, all data comes from the same distribution We calculate our statistic, such as mean difference, from the original data We then shuffle the data with respect to group and recalculate the statistic (mean difference) Repeat step 3 multiple times Find out where our statistic lies in comparison to the null distribution
56
A simple example To find out whether there is a mean difference between case vs. control case control 1 10 2 11 3 12 4 13 5 14 6 15 7 16 8 17 9 1.2812 18 Mean Mean difference .541
57
Permutation One Note how the different labels have been swapped for the permutation case control 9 1.2812 11 3 18 17 14 15 4 16 6 1 2 7 5 10 12 13 8 Mean Mean difference = .329
58
Permutations Repeat many many times (i.e. 1000 times) case control 9
1.2812 11 3 18 17 14 15 4 16 6 1 2 7 5 10 12 13 8 Mean Repeat many many times (i.e times)
59
Compute empirical p value
In 1000 times permutations There are three times the permuted data given large difference p = 0.003 None of the permuted mean difference is bigger than the original one p < or (1/1001) # prevent p value equal to zero
60
General Advantages Does not rely on distributional assumptions
Corrects for hidden selection Corrects for hidden correlation Disadvantage?? Computationally intensive
61
Question If you were asked to compute empirical p values for multiple group comparisons (i.e. ANOVA) What will be the value to calculate from the permutated samples?
62
Hypothesis Testing & multiple testing issues
63
Hypothesis Testing We start by assuming the null (H0) is true, and asking . . . “How likely is the result we got from our sample?” If that probability (p-value) is small, it suggests the observed result cannot be easily explained by chance
64
Hypothesis Testing (more details)
Goal: Make statement(s) regarding unknown population parameter values based on sample data Elements of a hypothesis test: Assume that samples are randomly selected from the population State the null hypothesis (H0) that distribution of the values in the two populations are the same Define a threshold (α) do declaring a P value significant Select appropriate statistical test and calculate the P value If the P value is less than α, then conclude that the difference is statistically significant and decide to reject the null hypothesis. Otherwise, conclude that the difference is not statistically significant and declare that the null hypothesis cannot be rejected.
65
Hypothesis Testing & P Value
The P value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed if the null hypothesis is true One "rejects the null hypothesis" when the p-value is less than the significance level α which is often 0.05 or 0.01 When the null hypothesis is rejected, the result is said to be statistically significant
66
Multiple Testing Issues
Omics is high-dimensional data 100s ~10,000s of variables Lots of hypothesis tests Performing T-tests on a typical microarray data might result in performing separate hypothesis tests. If we use a standard p value cut-off of 0.05, we would see 500 (10000*0.05) genes to be deemed “Significant” by chance
67
Multiple Testing Correction (I)
Family Wise Error Rate (FWER) - e.g. Bonferroni corrections Corrected P-value= p-value * n (number of genes in test) <0.05 As a consequence, if testing 1000 genes at a time, the highest accepted individual p value is , making the correction very stringent. With a Family-wise error rate of 0.05 (i.e., the probability of at least one error in the family), the expected number of false positives will be 0.05.
68
Multiple Testing Correction (II)
False Discovery Rate (FDR) - e.g. Benjamini-Hochberg A FDR of 0.05 means that 5% among the significant genes are expected to be false positives If 100 genes are declared DE, 20 of them will be false positive By controlling the FDR one can choose how many of the subsequent experiments one is willing to be in vain.
69
High-dimensional data
So far, our analyses are dealing with a single variable (i.e. univariate analysis) T-tests: one variable, two groups ANOVA: one variable, > 2 groups First analyze a single variable, and then apply the procedure to all variables, finally do multiple test adjustment Visualization are limited to three dimensions How can we analyze & visualize these high-dimensional data?
70
Multivariate Statistics
71
Multivariate Statistics
Multivariate means multiple variables If you measure a population using multiple measures at the same time such as height, weight, hair colour, clothing colour, eye colour, etc. you are performing multivariate statistics Multivariate statistics requires more complex, multidimensional analyses or dimensional reduction methods
72
Normal distribution – a single variable
73
Bivariate Normal
74
Trivariate Normal
75
The Reality Most statistical methods have been developed before the coming of the omics era Most statistical methods are designed for single or very few variables (typically less than 10) T-tests, ANOVA Linear/logistic regression These methods assume there are more samples (n) than the number of variables (p) (i.e. n > p) for parameter estimation In omics data, n << p
76
The Practice Omics data analyses have developed beyond the scope of the classical statistics. It borrows a lot of “strengths” from several other fields that typically involve dealing with large and complex data generated from the digital age Machine learning Chemometrics Visualization
77
Machine Learning Unsupervised learning: explore the data to find some intrinsic structures in them (disregard whether they are related to the class labels or not). Supervised learning: discover patterns in the data that relate data attributes with related to a target (class) attribute. These patterns can then be utilized to predict the values of the target attribute in future data instances
78
Unsupervised Learning methods for high-dimensional data
Clustering Organize the 1000s of variables into blocks Variables in each block are more homogenous, and treat these blocks as a unit. Dimension reduction Reduce the high-dimensional data into low-dimension i.e. from 1000s of variables to 2 ~ 3 variables Principal component analysis
79
Clustering Definition - a process by which objects that are logically similar in characteristics are grouped together.
80
Clustering Requires... A method to measure similarity (a similarity matrix) or dissimilarity (a dissimilarity coefficient) between objects A threshold value with which to decide whether an object belongs with a cluster A way of measuring the “distance” between two clusters A cluster seed (an object to begin the clustering process)
81
Two common clustering algorithms
K-means or Partitioning Methods - divides a set of N objects into M clusters -- with or without overlap Hierarchical Methods - produces a set of nested clusters in which each pair of objects is progressively nested into a larger cluster until only one cluster remains
82
K-means clustering Make the first object the centroid for the first cluster For the next object calculate the similarity to each existing centroid If the similarity is greater than a threshold add the object to the existing cluster and re-compute the centroid, else use the object to start new cluster Return to step 2 and repeat until done
83
K-means clustering - Initial cluster choose 1 choose 2 test & join
centroid= centroid= -
84
Nearest Neighbor Algorithm
Nearest Neighbor Algorithm is an agglomerative approach (bottom-up). Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached.
85
Hierarchical Clustering
Find the two closest objects and merge them into a cluster Find and merge the next two closest objects (or an object and a cluster, or two clusters) using some similarity measure and a predefined threshold If more than one cluster remains return to step 2 until finished
86
Key Parameters: similarities
Calculate the similarity between all possible combinations of two profiles Two most similar clusters are grouped together to form a new cluster Calculate the similarity between the new cluster and all remaining clusters. Similarity between samples Similarity between clusters
87
Similarity Measurements
Euclidean Distance: Absolute difference
88
Similarity Measurements
Pearson Correlation: Trend similarity +1 Pearson Correlation – 1
89
Clustering Single Linkage
Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters + + C2 C1 Tend to generate “long chains”
90
Clustering Complete Linkage
Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters + + C2 C1 Tend to generate “clumps”
91
Clustering Average Group Linkage
Dissimilarity between two clusters = Distance between two cluster means. + + C2 C1
92
Hierarchical clustering & heatmap
B B C C D B E Find 2 most similar metabolite expression levels or curves Find the next closest pair of levels or curves F Iterate
93
Principal Component Analysis (PCA)
Project high-dimensional data into lower dimensions that capture the most variance of the data Assumption: Main directions of variance ≈ major data characteristics
94
Visualizing PCA PCA of a “bagel”
One projection produces a weiner (hotdog) Another projection produces an “O” The “O” projection captures most of the variation and has the largest eigenvector (PC1) The weiner projection is PC2 and gives depth info
95
PCA - The Details PCA involves the calculation of the eigenvalue (singular value) decomposition of a data covariance matrix PCA is an orthogonal linear transformation PCA transforms data to a new coordinate system so that the greatest variance of the data comes to lie on the first coordinate (1st PC), the second greatest variance on the 2nd PC etc. x1 x2 x3, … variables ……. xn t1 t2 ….. tm s1 s2 s3… samples. sk p1 p2 pk Scores = t (eigen vectors uncorrelated orthogonal) ….. Loadings = p scores = loadings x data t1 = p1x1 + p2x2 + p3x3 + … + pnxn
96
PCA - The Details From k original variables: x1,x2,...,xk:
Produce k new variables: t1,t2,...,tk t1 = a11x1 + a12x a1kxk t2 = a21x1 + a22x a2kxk ... tk = ak1x1 + ak2x akkxk
97
PCA - The Details Such that: tk's are uncorrelated (orthogonal)
t1 explains as much as possible of original variance in data set t2 explains as much as possible of remaining variance, etc.
98
2nd Principal Component, t2 1st Principal Component, t1
99
Principal Components Analysis on:
Covariance Matrix: Variables must be in same units Emphasizes variables with most variance Correlation Matrix: Variables are standardized (mean 0.0, SD 1.0) Variables can be in different units All variables have same impact on analysis
100
Eigenfaces PCA was mainly developed in face recognition
Eigenfaces look somewhat like generic faces.
101
Widely used in metabolomics
Hundreds of peaks 2 components ANIT PAP Control -25 -20 -15 -10 -5 5 10 15 20 25 -30 PC1 PC2 PAP ANIT Control Scores plot
102
PCA Loadings Plot Loadings plot shows how much each of the variables (metabolites) contributed to the different principal components Variables at the extreme corners contribute most to the scores plot separation
103
Scores & Loadings
104
PCA Details/Advice In some cases PCA will not succeed in identifying any clear clusters or obvious groupings no matter how many components are used. If this is the case, it is wise to accept the result and assume that the presumptive classes or groups cannot be distinguished As a general rule, if a PCA analysis fails to achieve even a modest separation of classes, then it is probably not worthwhile using other statistical techniques to try to separate them
105
PCA Summary Rotates multivariate dataset into a new configuration which is easier to interpret Purposes Data overview Outlier detection Look at relationships between variables
106
PLS-DA When the experimental effects are subtle or moderate, PCA will not show good separation patterns PLS-DA is a supervised method, it is calculated by maximizing the co-variance between the data matrix (X) and the class labels (Y). It tends to always produce certain separation patterns with regard the conditions
107
PCA vs. PLS-DA
108
Use PLS-DA with Caution
PLS-DA is based on regressions. It first converts the class labels into numbers and then perform PLS regression between the data matrix and numerical Y. PLS-DA is susceptible to over-fitting by producing patterns of separation even for data randomly drawn from the same population. Need for cross validation Need to do permutation tests
109
Cross Validations Goal: test whether your model can predict class labels for new samples Train Test Dataset
110
Common Splitting Strategies
Leave-one-out (n-fold cross validation)
111
Components and Features
Cross validation is used to determine the optimal number of components needed to build the PLS-DA model. Three common performance measures: Sum of squares captured by the model (R2) Cross-validated R2 (also known as Q2) Prediction accuracy
112
Permutation Tests Goal: to test whether your model is significantly different from the null models Randomly shuffle the class labels (y) and build the (null) model between new y and x; Test whether there is still the similar patterns of separation; We can compute empirical p values If the result is similar as the permuted results (i.e. null model), then we can not say y and x is significantly correlated
113
Compute empirical p values
In 1000 times permutations, if none of the permuted mean difference is bigger than the original one p < or (1/1001) # prevent p value equal to zero
114
PLS-DA VIP Score Variable Importance in Projection (VIP) scores estimate the importance of each variable in the projection used in a PLS-DA model and is often used for variable selection A variable with a VIP Score >1 can be considered important in given model Variables with VIP scores <1 are less important and might be good candidates for exclusion from the model
115
VIP Plots
116
Assessing Regression Model Performance
Dependent on how we want to use the model. To understand the relationship between the predictor and the response. To use the model to predict future observations’ response. One basic measure Root mean square error (RMSE)
117
Assessing Classification Model Performance
For balanced data Accuracy: 9/13 correct => 69% accuracy Error rate: 1 – accuracy => 31% Not suitable for imbalanced data In a population, cancer incidence is low: ~5 cases in 1000 people. If a classifier predict all people to be healthy, then it is 99.5% accurate (majority vote)
118
Evaluating Performances
Basic concepts True positives (TP) True negatives (TN) False positives (FP) False negatives (FN). Sensitivity (Sn) Specificity (Sp) Sn = True positive rate Sp = True negative rate
119
Sn = TP/(TP + FN) Sp = TN/(TN + FP)
An Example Call these patients “negative” Call these patients “positive” True Positives (TP) Test Results Control Disease True Negatives (TN) False Positives (FP) False Negatives (FN) Sn = TP/(TP + FN) Sp = TN/(TN + FP)
120
ROC curves ROC = Receiver Operating Characteristic
A historic name from radar studies Very popular in biomedical applications To assess performance of classifiers. To compare different biomarker models A graphical plot of the true positive rate (TPR) vs. false positive rate (FPR), for a binary classifier (i.e. positive/negative) as its cutoff point is varied
121
True Positive Rate (sensitivity)
ROC Curve True Positive Rate (sensitivity) 0% 100% False Positive Rate (1-specificity) Tradeoff between sensitivity and specificity
122
Area under ROC curve (AUC)
Overall measure of test performance Comparisons between two tests based on differences between (estimated) AUC
123
AUC = 95% AUC = 70% A good test A poor test AUC = 50% AUC = 100%
TPR 0% 100% FPR Best test FRR Worst test A good test A poor test AUC = 95% AUC = 100% AUC = 70% AUC = 50%
124
Other Supervised Classification Methods
SIMCA – Soft Independent Modeling of Class Analogy OPLS – Orthogonal Projection of Latent Structures Support Vector Machines Random Forest Naïve Bayes Classifiers Neural Networks
125
Breaching the Data Barrier
Unsupervised Methods PCA K-means clustering Factor Analysis Supervised Methods PLS-DA LDA PLS-Regression Machine Learning Neural Networks Support Vector Machines Bayesian Belief Net
126
Data Analysis Progression
Unsupervised Methods PCA or cluster to see if natural clusters form or if data separates well Data is “unlabeled” (no prior knowledge) Supervised Methods/Machine Learning Data is labeled (prior knowledge) Used to see if data can be classified Helps separate less obvious clusters or features Statistical Significance Supervised methods always generate clusters -- this can be very misleading Check if clusters are real by label permutation
127
Note of Caution Supervised classification methods are powerful
Learn from experience Generalize from previous examples Perform pattern recognition Too many people skip the PCA or clustering steps and jump straight to supervised methods Some get great separation and think the job is done - this is where the errors begin… Too many don’t assess significance using permutation testing or n-fold cross validation If separation isn’t partially obvious by eye-balling your data, you may be treading on thin ice
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.