Unit of Statistical Genetics, Kyoto University

Unit of Statistical Genetics, Kyoto University
Statistical Analyses of Life Science and Pathology from Genomic Perspective Unit of Statistical Genetics, Kyoto University Ryo Yamada

Contents of Today 2 Genotypes and Phenotypes ～Data Records for Statistical Analyses～ 1 Overview of Statistical Methods

Goals of Today Get a perspective covering many topics .
Grab the idea of each topics. Do not attempt to understand their details. Make a list of terms some of which you might learn further after today’s lecture. Understand some ideas appear in multiple analysis settings. Multiple basic ideas are combined to analyze specific tasks in various settings.

Overview of Statistical Methods

Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs

Quality Control of Noisy High-Throughput Data
Systematic errors/ biases; samples, reagents, date/machine/personnel effects How to Correct or control the noises Outsider detection Transformation of all records with a function Normalization for “locational effects” “Control samples”

Outsider detection

Transformation of all records with a function
Genomic control for GWAS Preprocessing Micorarray Data Median-based correction Log-transformation

Normalization for “locational effects”
Tendency should be considered. Batch effects should be considered. Non-data-driven Data-driven

Quality Control of High-Throughput Data Correction/control
Systematic errors/ biases; samples, reagents, date/machine/personnel effects Correction/control Outsider detection Transformation of all records with a function Normalization for “locational effects” “Control samples”

Tests, Estimation/Inference, Classification/Clustering
Significance, Error Controlling, Multiple-testing issue Estimation/Inference Interval, Models, Bayes Classification/Clustering Unsupervised Learning vs. Supervised Learning

Multiple Comparison P-value vs. Q-value

Multiple Comparison Almost all hypotheses are NULL

Small p-values are likely when testing many hypotheses
1 test : Uniform between 0 and 1 10 tests : Minimum p should be close to 0, or around 0.1 100 tests : Closer to 0, or around 0.01 …

Uniform distribution

2 tests How do smaller p values distribute? p2 p1

Minimum p-value distribution
Mean 2^10 Min-p may take quite larger value than the mean. In many cases, min-p value is smaller than the mean. Such small value are not rare.

Minimum p-value distribution
1,2,4,8,… ^6 1,2,4,8,… ^6

NON-NULL, FDR (False Discovery Rate)
Many hypotheses are NON-NULL, or Almost all hypotheses are NON- NULL

P-value n <- 2^(0:20) n.iter <- 10^5
minps <- matrix(0,n.iter,length(n)) for(i in 1:length(n)){ ps <- matrix(runif(n[i]*n.iter),ncol=n[i]) minps[,i] <- apply(ps,1,min) } boxplot(minps) boxplot(log(minps,10)) hist(minps[,11],main="2^10") N <- 9000 n <- 1000 K <- rchisq(N,1) k <- rchisq(n,1,ncp=6) hist(c(K,k),density=15) hist(K,density=17,col=3,add=TRUE) hist(k,add=TRUE,col=2,density=21) P <- pchisq(K,1,lower.tail=FALSE) p <- pchisq(k,1,lower.tail=FALSE) hist(c(P,p),density=15) hist(p,density=21,col=2) hist(P,density=17,add=TRUE,col=3) Pp <- c(P,p) col <- c(rep(1,N),rep(2,n)) ord <- order(Pp) plot(Pp[ord],col=col[ord],pch=20,cex=0.1,type="h") plot(Pp[ord],col=col[ord],pch=20,cex=0.1,type="h",ylim=c(0,0.1),xlim=c(0,3*10^3))

Combination of two distributions
Uniform p-values Small p-values

Pick smaller p-values. Threshold value should be changed for the ranks of p-values. The fraction of “true positives” is controlled.

Large-scale inference
When you observed many at once, their distribution is informative. The estimates of each observation using the information are different from the estimates not using the information. q-value of FDR is one type of such estimates. Use information of distribution when observed many together Empirical Bayes

Estimation/Inference
Models, Parameters, Interval, Bayes Uniform p-values Small p-values Assuming the mixture of two distributions; This is a model.

Samples　→　Point estimates, Interval estimates Sample distribution, Theoretical estimates, unbiased estimates,…

Samples　→　Point estimates, Interval estimates Sample distribution, Theoretical estimates, unbiased estimates,… The statement “The star’s weight is between a and b” will be right 9 times out of 10 times.

Samples　→　Point estimates, Interval estimates Sample distribution, Theoretical estimates, unbiased estimates,… Frequentist The statement “The star’s weight is between a and b” will be right 9 times out of 10 times.

Frequentists　vs. 　Bayesians Frequentists approaches are difficult for students not good at mathematics and their thinking processes are not easy to follow. Instead Bayesian thinking processes tend to be easy to follow for many.

Bayesian Model has parameter(s) Dara　＋　Model　→　Estimation of parameter value Likelihood-based; Maximum-likelihood estimates; Interval estimates based on likelihood

Frequentists　vs. 　Bayesians Use both, not select one of them, it is the way in 21-st century Bayesian approaches seem to be used more and more, because Models became more complicated. Computers’ assists・・・Complicated distributions can be handled simulationally Large-scale data ・・・Empirical Bayes approaches can be applied

Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs Estimation/Inference Frequentists　vs. 　Bayesians Use both, not select one of them, it is the way in 21-st century Bayesian approaches seem to be used more and more, because Models became more complicated. Computers’ assists・・・Complicated distributions can be handled simulationally Large-scale data ・・・Empirical Bayes approaches can be applied

Frequentists　vs. 　Bayesians “Prior” distribution is necessary What is the “appropriate prior”?

Success rate：No information at all
Somebody you don’t know at all will take an exam on which you have no information at all. How likely do you think (s)he will pass it?

Success rate：No information at all
Somebody you don’t know at all will take an exam on which you have no information at all. How likely do you think (s)he will pass it? Jeffreys prior One of non-subjective priors

Frequentists　vs. 　Bayesians Use both, not select one of them, it is the way in 21-st century Large scale inference Prior can be set based on the data set ~ empirical Bayesian

Classification/Clustering
Multi-dimension/High-dimension, first.

Multi-dimensional/High-dimensional Data
No way to visualize high-dimensional data Almost impossible for US to understand in high-dimensional data themselves

How many dimensions can we handle? 2D space or 3D space Extra dimensions Gray/Color scale Arrows Time

Dimension reduction Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them.

Dimension reduction Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them. PCA (Principal Component Analysis)

Dimension reduction Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them. Only few dims are truly meaningful and all the others are noize. Pick the true dims.

Dimension reduction Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them. Only few dims are truly meaningful and all the others are noize. Pick the true dims. LASSO, Compression sensing

Space is high dimensional but data is low Manifold learning Put data into higher dimensional space and pull them back to low dim space.

High-dimensionality Many genes, many biomarkers, many features

Life-science data are high- dimensional Number of observed items are huge. But the items are mutually strongly correlated , and their dimension is much smaller in reality. FACS Ethnic diversity

Objects with low dimensions in higher dimensional space Topology

Objects with low dimensions in higher dimensional space Topology Graph, network and topology

Graph: Itemize and connect items with relation Pairwise relations are cared.

Graph: Itemize and connect items with relation Pairwise relations are cared. No care for trio-wise or higher relations.

Graph and its matrix representation and linear algebra

Graph and its matrix representation and linear algebra Graph tends to be sparse … Sparse analysis

Two important features No “common” individuals Sparse

High-dimensionality No commons 3.14 / 4 = 0.785
Central area : a sphere in a cubic 3.14 / 4 = 0.785

High-dimensionality Sparse
To estimate density, you need reasonable number of samples per small cubic volume, but… Dim = 1 : 0.1 Dim = 2 : 0.01 Dim = 3 : 0.001 …. Dime = 6 :

High-dimensionality Quite spacious, but reasonably dense distribution.
Distribution should be low dimensional.

Life-science data are high- dimensional Number of observed items are huge. But the items are mutually strongly correlated , and their dimension is much smaller in reality. FACS Ethnic diversity

Low dimensional distribution in higher dimensional space and its local density
Regular density estimation method does not work. Small cubic are still spacious in high dimensional space How to estimate local density k-nearest neighbor method In graph theory, similar idea is applicable. Minimum-spanning tree

Sparse in highly dimensional space

How sparse? One-dimensional manifolds But significant variance

How sparse? One-dimensional manifolds But significant variance Clustering

Two types of clustering methods
Hierarchical Non-hierarchical

Hierarchical Tree structure --- Graph, again
Its structure has information Its structure is related with dimension On the tree, distance is defined. Some phenomena have reasons to be analyzed hierarchically.

Classification Separate something difficult to segregate.
J. Med. Imag. 1(3), (Oct 09, 2014). doi: /1.JMI

Unsupervised Learning Supervised Learning

Unsupervised Learning Supervised Learning No teacher, but want to check whether the classification criteria is reliable or not. Cross-validataion: One of resampling methods

Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs A bit more high-dimension issue

Small n Large p Sample size 100
Test association between a trait and expression of A gene. N = 100, p = 1 Large n Small p Test association between a trait and expression of MANY genes. N = 100, p = 25000 Small n Large p

n = p then, you can find the perfect regression answer
q = a x; q = 3, x = 2 → Solvable q1 = a x1 + b y1 q2 = a x2 + b y2 　→　Solvable q1 = a x1 + b y1 + c z1 q2 = a x2 + b y2 + c z2　 q3 = a x3 + b y3 + c z3 →　Solvable

n << p One set of variables gives the perfect answer.
Another set of variables gives the perfect but different answer. Which answer is the truth? Closer fitting is not always the best. AIC ～ Simpler model is better LASSO, Sparse The assumption k << n variables should be the answer, that is “prior” belief of Bayesian

Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches　～　Monte-Carlo method Others : Experimental Designs

Resampling Estimation based on samples Statistical significance
Jack-knife（Subsets）、Bootstrap(Replacement) Statistical significance Permutation ～ Exact probability Cross-validation

Resampling Estimation based on samples Statistical significance
Jack-knife（Subsets）、Bootstrap(Replacement) Statistical significance Permutation ～ Exact probability Cross-validation Pseudo-random generators from computers

Pseudo-random number sequences
From uniform distribution From other known distributions

Pseudo-random number sequences
From uniform distribution From other known distributions From arbitrary distributions … Gibbs sampling

Psuedo-random number sequences
From uniform distribution From other known distributions From arbitrary distributions … Gibbs sampling Using Gibbs sampler, Based on a stochastic model, estimate parameters of distributions and generate random values form the estimated distributions… BUGS (Bayesian inference using Gibbs Sampling)

Example Fraction of red vs. green is repeatedly estimated
Based on the assumption that the red is non-central chi- square distribution and its non-central parameter value is repeatedly estimated. Eventually estimate both unknown parameter values

Psuedo-random number sequences
From uniform distribution From other known distributions From arbitrary distributions … Gibbs sampling Using Gibbs sampler, Based on a stochastic model, estimate parameters of distributions and generate random values form the estimated distributions… BUGS (Bayesian inference using Gibbs Sampling) MCMC(Markov-Chain Monte-Carlo) With Stan (a Bayesian estimation application)

Pseudo-random numbers, Monte-Carlo
Computer-driven methods

Experimental designs Various data sets
Using all kinds of them, what can we state?

Individual analysis/interpretation is tough enough Integration of them is tougher
Construct a model/assumption to integrate multiple sets There are variations how to combine; order of combination, structure of combination …. Use raw data sets from multiple resources. Integrate primary outputs from each data set; so called meta-analyses Narrow-sense meta-analyses only combine outputs from similar data sets analyses. Difficulties are rooted to the heterogeneity among various data- resources. Each data-resource has its own way of analysis. The variations among analyses make the integration difficult. Then, make each analysis method unified???

Some resources http://statgenet-kyotouniv.wikidot.com/MasterCourse2017
Its linked sites would be helpful to broaden and deepen your understandings of todays lecture.

Genotypes and Phenotypes ～Data Records for Statistical Analyses～

Childhood Birth Death Phenome Adolescent Phenotypes Aging Development Adult Individual Fertilized Egg Metabolites Metabolome Organ Gametes Cell Tissue Genes Molecules DNA Proteins Molecular Genetics at a glance RNA Proteome Genome Transcriptome Omics Heredity Big Data Phenotypic diversity Stochastics Genetic diversity Statistics Diversity

Genotypes and Phenotypes
Only one Spatio-Temporally Heterogeneous Chronologically and Spatially

DNA/Chromosome modification
DNA,Genome, Consistent Genotype Phenotype DNA/Chromosome modification Transcriptome Intermediate phenotype Protein/Proteome Phenotype/Phenome Phenotypes of Individuals Terminal phenotype

Time x Space of Individual
Cancer Functional Somatic Mutations Somatic Mosaicism Parents Germline Fertilized Egg Mutations

Diversity in Phenotypes
Easy to measure vs. Not-easy to measure Representatives vs. Distribution itself Many but Mutually Independent vs. Mutually Correlated

Representatives vs. Distribution itself
Temperature A representative of molecular population Independent and Identically Distributed Variable Observation Good-shaped distribution→Representatives→Parametric approach Bad-shaped distribution→Distribution itself →Non-parametric approach One sample is a set of observations One sample gives a distribution→Representatives enough? ThermoFisher Scientific社

Many but Mutually Independent vs. Mutually Correlated
Multiple items mutually correlated. Chronological data(time-line) Shape data(space continuity) Movement data(Time x Space) Patten data(Informational axies) 横河電機 Nature 465, 918–921 (17 June 2010)

Estimation of parameter values
Data　＋　Model　 Estimation of parameter values ↓　

Summary for Genotypes and Phenotypes
Data　＋　Model　 Estimation of parameter values ↓　 To start your analysis Record “Values” “Values” take various shapes “Simple value” : a Number “Number” Natural Number, Integer, Rationals, Reals, Complex,Vector,Matrix … “Values” for analysis but not “Numbers” Mathematical models Biological phenomena have random errors Stochastic models, Statistical models Models have parameters, then “Values” for parameters are “Numbers” again “Simple values” are values of parameters in simple models. Complex models and their parameters can be also values for your analysis.

Summary for Genotypes and Phenotypes
To start your analysis Record “Values” “Values” take various shapes “Simple value” : a Number “Number” Natural Number, Integer, Rationals, Reals, Complex,Vector,Matrix … “Values” for analysis but not “Numbers” Mathematical models Biological phenomena have random errors Stochastic models, Statistical models Models have parameters, then “Values” for parameters are “Numbers” again “Simple values” are values of parameters in simple models. Complex models and their parameters can be also values for your analysis.

Contents of Today Genotypes and Phenotypes ～Data Records for Statistical Analyses～ Overview of Statistical Methods

Unit of Statistical Genetics, Kyoto University

Similar presentations

Presentation on theme: "Unit of Statistical Genetics, Kyoto University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unit of Statistical Genetics, Kyoto University

Similar presentations

Presentation on theme: "Unit of Statistical Genetics, Kyoto University"— Presentation transcript:

Similar presentations

About project

Feedback