Download presentation
Presentation is loading. Please wait.
Published byBeryl Henderson Modified over 7 years ago
1
Unit of Statistical Genetics, Kyoto University
Statistical Analyses of Life Science and Pathology from Genomic Perspective Unit of Statistical Genetics, Kyoto University Ryo Yamada
2
Contents of Today 2 Genotypes and Phenotypes ~Data Records for Statistical Analyses~ 1 Overview of Statistical Methods
3
Goals of Today Get a perspective covering many topics .
Grab the idea of each topics. Do not attempt to understand their details. Make a list of terms some of which you might learn further after today’s lecture. Understand some ideas appear in multiple analysis settings. Multiple basic ideas are combined to analyze specific tasks in various settings.
4
Overview of Statistical Methods
5
Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs
6
Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs
7
Quality Control of Noisy High-Throughput Data
Systematic errors/ biases; samples, reagents, date/machine/personnel effects How to Correct or control the noises Outsider detection Transformation of all records with a function Normalization for “locational effects” “Control samples”
8
Outsider detection
9
Transformation of all records with a function
Genomic control for GWAS Preprocessing Micorarray Data Median-based correction Log-transformation
10
Normalization for “locational effects”
Tendency should be considered. Batch effects should be considered. Non-data-driven Data-driven
11
Quality Control of High-Throughput Data Correction/control
Systematic errors/ biases; samples, reagents, date/machine/personnel effects Correction/control Outsider detection Transformation of all records with a function Normalization for “locational effects” “Control samples”
12
Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs
13
Tests, Estimation/Inference, Classification/Clustering
Significance, Error Controlling, Multiple-testing issue Estimation/Inference Interval, Models, Bayes Classification/Clustering Unsupervised Learning vs. Supervised Learning
14
Tests, Estimation/Inference, Classification/Clustering
Significance, Error Controlling, Multiple-testing issue Estimation/Inference Interval, Models, Bayes Classification/Clustering Unsupervised Learning vs. Supervised Learning
15
Multiple Comparison P-value vs. Q-value
16
Multiple Comparison Almost all hypotheses are NULL
17
Small p-values are likely when testing many hypotheses
1 test : Uniform between 0 and 1 10 tests : Minimum p should be close to 0, or around 0.1 100 tests : Closer to 0, or around 0.01 …
18
Uniform distribution
19
2 tests How do smaller p values distribute? p2 p1
20
Minimum p-value distribution
Mean 2^10 Min-p may take quite larger value than the mean. In many cases, min-p value is smaller than the mean. Such small value are not rare.
21
Minimum p-value distribution
1,2,4,8,… ^6 1,2,4,8,… ^6
22
NON-NULL, FDR (False Discovery Rate)
Many hypotheses are NON-NULL, or Almost all hypotheses are NON- NULL
23
P-value n <- 2^(0:20) n.iter <- 10^5
minps <- matrix(0,n.iter,length(n)) for(i in 1:length(n)){ ps <- matrix(runif(n[i]*n.iter),ncol=n[i]) minps[,i] <- apply(ps,1,min) } boxplot(minps) boxplot(log(minps,10)) hist(minps[,11],main="2^10") N <- 9000 n <- 1000 K <- rchisq(N,1) k <- rchisq(n,1,ncp=6) hist(c(K,k),density=15) hist(K,density=17,col=3,add=TRUE) hist(k,add=TRUE,col=2,density=21) P <- pchisq(K,1,lower.tail=FALSE) p <- pchisq(k,1,lower.tail=FALSE) hist(c(P,p),density=15) hist(p,density=21,col=2) hist(P,density=17,add=TRUE,col=3) Pp <- c(P,p) col <- c(rep(1,N),rep(2,n)) ord <- order(Pp) plot(Pp[ord],col=col[ord],pch=20,cex=0.1,type="h") plot(Pp[ord],col=col[ord],pch=20,cex=0.1,type="h",ylim=c(0,0.1),xlim=c(0,3*10^3))
24
Combination of two distributions
Uniform p-values Small p-values
28
Pick smaller p-values. Threshold value should be changed for the ranks of p-values. The fraction of “true positives” is controlled.
29
Large-scale inference
When you observed many at once, their distribution is informative. The estimates of each observation using the information are different from the estimates not using the information. q-value of FDR is one type of such estimates. Use information of distribution when observed many together Empirical Bayes
30
Tests, Estimation/Inference, Classification/Clustering
Significance, Error Controlling, Multiple-testing issue Estimation/Inference Interval, Models, Bayes Classification/Clustering Unsupervised Learning vs. Supervised Learning
31
Estimation/Inference
Models, Parameters, Interval, Bayes Uniform p-values Small p-values Assuming the mixture of two distributions; This is a model.
32
Estimation/Inference
Samples → Point estimates, Interval estimates Sample distribution, Theoretical estimates, unbiased estimates,…
33
Estimation/Inference
Samples → Point estimates, Interval estimates Sample distribution, Theoretical estimates, unbiased estimates,… The statement “The star’s weight is between a and b” will be right 9 times out of 10 times.
34
Estimation/Inference
Samples → Point estimates, Interval estimates Sample distribution, Theoretical estimates, unbiased estimates,… Frequentist The statement “The star’s weight is between a and b” will be right 9 times out of 10 times.
35
Estimation/Inference
Frequentists vs. Bayesians Frequentists approaches are difficult for students not good at mathematics and their thinking processes are not easy to follow. Instead Bayesian thinking processes tend to be easy to follow for many.
36
Estimation/Inference
Bayesian Model has parameter(s) Dara + Model → Estimation of parameter value Likelihood-based; Maximum-likelihood estimates; Interval estimates based on likelihood
37
Estimation/Inference
Frequentists vs. Bayesians Use both, not select one of them, it is the way in 21-st century Bayesian approaches seem to be used more and more, because Models became more complicated. Computers’ assists・・・Complicated distributions can be handled simulationally Large-scale data ・・・Empirical Bayes approaches can be applied
38
Estimation/Inference
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs Estimation/Inference Frequentists vs. Bayesians Use both, not select one of them, it is the way in 21-st century Bayesian approaches seem to be used more and more, because Models became more complicated. Computers’ assists・・・Complicated distributions can be handled simulationally Large-scale data ・・・Empirical Bayes approaches can be applied
39
Estimation/Inference
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs Estimation/Inference Frequentists vs. Bayesians Use both, not select one of them, it is the way in 21-st century Bayesian approaches seem to be used more and more, because Models became more complicated. Computers’ assists・・・Complicated distributions can be handled simulationally Large-scale data ・・・Empirical Bayes approaches can be applied
40
Estimation/Inference
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs Estimation/Inference Frequentists vs. Bayesians Use both, not select one of them, it is the way in 21-st century Bayesian approaches seem to be used more and more, because Models became more complicated. Computers’ assists・・・Complicated distributions can be handled simulationally Large-scale data ・・・Empirical Bayes approaches can be applied
41
Estimation/Inference
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs Estimation/Inference Frequentists vs. Bayesians Use both, not select one of them, it is the way in 21-st century Bayesian approaches seem to be used more and more, because Models became more complicated. Computers’ assists・・・Complicated distributions can be handled simulationally Large-scale data ・・・Empirical Bayes approaches can be applied
42
Estimation/Inference
Frequentists vs. Bayesians “Prior” distribution is necessary What is the “appropriate prior”?
43
Success rate:No information at all
Somebody you don’t know at all will take an exam on which you have no information at all. How likely do you think (s)he will pass it?
44
Success rate:No information at all
Somebody you don’t know at all will take an exam on which you have no information at all. How likely do you think (s)he will pass it? Jeffreys prior One of non-subjective priors
45
Estimation/Inference
Frequentists vs. Bayesians Use both, not select one of them, it is the way in 21-st century Large scale inference Prior can be set based on the data set ~ empirical Bayesian
46
Tests, Estimation/Inference, Classification/Clustering
Significance, Error Controlling, Multiple-testing issue Estimation/Inference Interval, Models, Bayes Classification/Clustering Unsupervised Learning vs. Supervised Learning
47
Classification/Clustering
Multi-dimension/High-dimension, first.
48
Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs
49
Multi-dimensional/High-dimensional Data
No way to visualize high-dimensional data Almost impossible for US to understand in high-dimensional data themselves
50
Multi-dimensional/High-dimensional Data
How many dimensions can we handle? 2D space or 3D space Extra dimensions Gray/Color scale Arrows Time
51
Multi-dimensional/High-dimensional Data
Dimension reduction Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them.
52
Multi-dimensional/High-dimensional Data
Dimension reduction Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them. PCA (Principal Component Analysis)
53
Multi-dimensional/High-dimensional Data
Dimension reduction Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them. Only few dims are truly meaningful and all the others are noize. Pick the true dims.
54
Multi-dimensional/High-dimensional Data
Dimension reduction Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them. Only few dims are truly meaningful and all the others are noize. Pick the true dims. LASSO, Compression sensing
55
Multi-dimensional/High-dimensional Data
Space is high dimensional but data is low Manifold learning Put data into higher dimensional space and pull them back to low dim space.
56
High-dimensionality Many genes, many biomarkers, many features
57
Multi-dimensional/High-dimensional Data
Life-science data are high- dimensional Number of observed items are huge. But the items are mutually strongly correlated , and their dimension is much smaller in reality. FACS Ethnic diversity
58
Multi-dimensional/High-dimensional Data
Objects with low dimensions in higher dimensional space Topology
59
Multi-dimensional/High-dimensional Data
Objects with low dimensions in higher dimensional space Topology Graph, network and topology
60
Multi-dimensional/High-dimensional Data
Graph: Itemize and connect items with relation Pairwise relations are cared.
61
Multi-dimensional/High-dimensional Data
Graph: Itemize and connect items with relation Pairwise relations are cared. No care for trio-wise or higher relations.
62
Multi-dimensional/High-dimensional Data
Graph and its matrix representation and linear algebra
63
Multi-dimensional/High-dimensional Data
Graph and its matrix representation and linear algebra Graph tends to be sparse … Sparse analysis
64
Multi-dimensional/High-dimensional Data
Two important features No “common” individuals Sparse
65
High-dimensionality No commons 3.14 / 4 = 0.785
Central area : a sphere in a cubic 3.14 / 4 = 0.785
66
High-dimensionality Sparse
To estimate density, you need reasonable number of samples per small cubic volume, but… Dim = 1 : 0.1 Dim = 2 : 0.01 Dim = 3 : 0.001 …. Dime = 6 :
67
High-dimensionality Quite spacious, but reasonably dense distribution.
Distribution should be low dimensional.
68
Multi-dimensional/High-dimensional Data
Life-science data are high- dimensional Number of observed items are huge. But the items are mutually strongly correlated , and their dimension is much smaller in reality. FACS Ethnic diversity
69
Low dimensional distribution in higher dimensional space and its local density
Regular density estimation method does not work. Small cubic are still spacious in high dimensional space How to estimate local density k-nearest neighbor method In graph theory, similar idea is applicable. Minimum-spanning tree
70
Sparse in highly dimensional space
71
Sparse in highly dimensional space
How sparse? One-dimensional manifolds But significant variance
72
Sparse in highly dimensional space
How sparse? One-dimensional manifolds But significant variance
73
Sparse in highly dimensional space
How sparse? One-dimensional manifolds But significant variance Clustering
74
Tests, Estimation/Inference, Classification/Clustering
Significance, Error Controlling, Multiple-testing issue Estimation/Inference Interval, Models, Bayes Classification/Clustering Unsupervised Learning vs. Supervised Learning
75
Two types of clustering methods
Hierarchical Non-hierarchical
76
Hierarchical Tree structure --- Graph, again
Its structure has information Its structure is related with dimension On the tree, distance is defined. Some phenomena have reasons to be analyzed hierarchically.
77
Classification Separate something difficult to segregate.
J. Med. Imag. 1(3), (Oct 09, 2014). doi: /1.JMI
78
Classification/Clustering
Unsupervised Learning Supervised Learning
79
Classification/Clustering
Unsupervised Learning Supervised Learning No teacher, but want to check whether the classification criteria is reliable or not. Cross-validataion: One of resampling methods
80
Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs
81
Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs A bit more high-dimension issue
82
Small n Large p Sample size 100
Test association between a trait and expression of A gene. N = 100, p = 1 Large n Small p Test association between a trait and expression of MANY genes. N = 100, p = 25000 Small n Large p
83
n = p then, you can find the perfect regression answer
q = a x; q = 3, x = 2 → Solvable q1 = a x1 + b y1 q2 = a x2 + b y2 → Solvable q1 = a x1 + b y1 + c z1 q2 = a x2 + b y2 + c z2 q3 = a x3 + b y3 + c z3 → Solvable
84
n << p One set of variables gives the perfect answer.
Another set of variables gives the perfect but different answer. Which answer is the truth? Closer fitting is not always the best. AIC ~ Simpler model is better LASSO, Sparse The assumption k << n variables should be the answer, that is “prior” belief of Bayesian
85
Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs
86
Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches ~ Monte-Carlo method Others : Experimental Designs
87
Resampling Estimation based on samples Statistical significance
Jack-knife(Subsets)、Bootstrap(Replacement) Statistical significance Permutation ~ Exact probability Cross-validation
88
Resampling Estimation based on samples Statistical significance
Jack-knife(Subsets)、Bootstrap(Replacement) Statistical significance Permutation ~ Exact probability Cross-validation Pseudo-random generators from computers
89
Pseudo-random number sequences
From uniform distribution From other known distributions
90
Pseudo-random number sequences
From uniform distribution From other known distributions From arbitrary distributions … Gibbs sampling
91
Psuedo-random number sequences
From uniform distribution From other known distributions From arbitrary distributions … Gibbs sampling Using Gibbs sampler, Based on a stochastic model, estimate parameters of distributions and generate random values form the estimated distributions… BUGS (Bayesian inference using Gibbs Sampling)
92
Example Fraction of red vs. green is repeatedly estimated
Based on the assumption that the red is non-central chi- square distribution and its non-central parameter value is repeatedly estimated. Eventually estimate both unknown parameter values
93
Psuedo-random number sequences
From uniform distribution From other known distributions From arbitrary distributions … Gibbs sampling Using Gibbs sampler, Based on a stochastic model, estimate parameters of distributions and generate random values form the estimated distributions… BUGS (Bayesian inference using Gibbs Sampling) MCMC(Markov-Chain Monte-Carlo) With Stan (a Bayesian estimation application)
94
Pseudo-random numbers, Monte-Carlo
Computer-driven methods
95
Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs
96
Experimental designs Various data sets
Using all kinds of them, what can we state?
98
Individual analysis/interpretation is tough enough Integration of them is tougher
Construct a model/assumption to integrate multiple sets There are variations how to combine; order of combination, structure of combination …. Use raw data sets from multiple resources. Integrate primary outputs from each data set; so called meta-analyses Narrow-sense meta-analyses only combine outputs from similar data sets analyses. Difficulties are rooted to the heterogeneity among various data- resources. Each data-resource has its own way of analysis. The variations among analyses make the integration difficult. Then, make each analysis method unified???
99
Some resources http://statgenet-kyotouniv.wikidot.com/MasterCourse2017
Its linked sites would be helpful to broaden and deepen your understandings of todays lecture.
100
Genotypes and Phenotypes ~Data Records for Statistical Analyses~
101
Childhood Birth Death Phenome Adolescent Phenotypes Aging Development Adult Individual Fertilized Egg Metabolites Metabolome Organ Gametes Cell Tissue Genes Molecules DNA Proteins Molecular Genetics at a glance RNA Proteome Genome Transcriptome Omics Heredity Big Data Phenotypic diversity Stochastics Genetic diversity Statistics Diversity
102
Genotypes and Phenotypes
Only one Spatio-Temporally Heterogeneous Chronologically and Spatially
103
DNA/Chromosome modification
DNA,Genome, Consistent Genotype Phenotype DNA/Chromosome modification Transcriptome Intermediate phenotype Protein/Proteome Phenotype/Phenome Phenotypes of Individuals Terminal phenotype
104
Time x Space of Individual
Cancer Functional Somatic Mutations Somatic Mosaicism Parents Germline Fertilized Egg Mutations
105
Diversity in Phenotypes
Easy to measure vs. Not-easy to measure Representatives vs. Distribution itself Many but Mutually Independent vs. Mutually Correlated
106
Representatives vs. Distribution itself
Temperature A representative of molecular population Independent and Identically Distributed Variable Observation Good-shaped distribution→Representatives→Parametric approach Bad-shaped distribution→Distribution itself →Non-parametric approach One sample is a set of observations One sample gives a distribution→Representatives enough? ThermoFisher Scientific社
107
Many but Mutually Independent vs. Mutually Correlated
Multiple items mutually correlated. Chronological data(time-line) Shape data(space continuity) Movement data(Time x Space) Patten data(Informational axies) 横河電機 Nature 465, 918–921 (17 June 2010)
108
Estimation of parameter values
Data + Model Estimation of parameter values ↓
109
Summary for Genotypes and Phenotypes
Data + Model Estimation of parameter values ↓ To start your analysis Record “Values” “Values” take various shapes “Simple value” : a Number “Number” Natural Number, Integer, Rationals, Reals, Complex,Vector,Matrix … “Values” for analysis but not “Numbers” Mathematical models Biological phenomena have random errors Stochastic models, Statistical models Models have parameters, then “Values” for parameters are “Numbers” again “Simple values” are values of parameters in simple models. Complex models and their parameters can be also values for your analysis.
110
Summary for Genotypes and Phenotypes
To start your analysis Record “Values” “Values” take various shapes “Simple value” : a Number “Number” Natural Number, Integer, Rationals, Reals, Complex,Vector,Matrix … “Values” for analysis but not “Numbers” Mathematical models Biological phenomena have random errors Stochastic models, Statistical models Models have parameters, then “Values” for parameters are “Numbers” again “Simple values” are values of parameters in simple models. Complex models and their parameters can be also values for your analysis.
111
Contents of Today Genotypes and Phenotypes ~Data Records for Statistical Analyses~ Overview of Statistical Methods
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.