Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unit of Statistical Genetics, Kyoto University

Similar presentations


Presentation on theme: "Unit of Statistical Genetics, Kyoto University"— Presentation transcript:

1 Unit of Statistical Genetics, Kyoto University
Statistical Analyses of Life Science and Pathology from Genomic Perspective Unit of Statistical Genetics, Kyoto University Ryo Yamada

2 Contents of Today 2 Genotypes and Phenotypes ~Data Records for Statistical Analyses~ 1 Overview of Statistical Methods

3 Goals of Today Get a perspective covering many topics .
Grab the idea of each topics. Do not attempt to understand their details. Make a list of terms some of which you might learn further after today’s lecture. Understand some ideas appear in multiple analysis settings. Multiple basic ideas are combined to analyze specific tasks in various settings.

4 Overview of Statistical Methods

5 Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs

6 Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs

7 Quality Control of Noisy High-Throughput Data
Systematic errors/ biases; samples, reagents, date/machine/personnel effects How to Correct or control the noises Outsider detection Transformation of all records with a function Normalization for “locational effects” “Control samples”

8 Outsider detection

9 Transformation of all records with a function
Genomic control for GWAS Preprocessing Micorarray Data Median-based correction Log-transformation

10 Normalization for “locational effects”
Tendency should be considered. Batch effects should be considered. Non-data-driven Data-driven

11 Quality Control of High-Throughput Data Correction/control
Systematic errors/ biases; samples, reagents, date/machine/personnel effects Correction/control Outsider detection Transformation of all records with a function Normalization for “locational effects” “Control samples”

12 Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs

13 Tests, Estimation/Inference, Classification/Clustering
Significance, Error Controlling, Multiple-testing issue Estimation/Inference Interval, Models, Bayes Classification/Clustering Unsupervised Learning vs. Supervised Learning

14 Tests, Estimation/Inference, Classification/Clustering
Significance, Error Controlling, Multiple-testing issue Estimation/Inference Interval, Models, Bayes Classification/Clustering Unsupervised Learning vs. Supervised Learning

15 Multiple Comparison P-value vs. Q-value

16 Multiple Comparison Almost all hypotheses are NULL

17 Small p-values are likely when testing many hypotheses
1 test : Uniform between 0 and 1 10 tests : Minimum p should be close to 0, or around 0.1 100 tests : Closer to 0, or around 0.01

18 Uniform distribution

19 2 tests How do smaller p values distribute? p2 p1

20 Minimum p-value distribution
Mean 2^10 Min-p may take quite larger value than the mean. In many cases, min-p value is smaller than the mean. Such small value are not rare.

21 Minimum p-value distribution
1,2,4,8,… ^6 1,2,4,8,… ^6

22 NON-NULL, FDR (False Discovery Rate)
Many hypotheses are NON-NULL, or Almost all hypotheses are NON- NULL

23 P-value n <- 2^(0:20) n.iter <- 10^5
minps <- matrix(0,n.iter,length(n)) for(i in 1:length(n)){ ps <- matrix(runif(n[i]*n.iter),ncol=n[i]) minps[,i] <- apply(ps,1,min) } boxplot(minps) boxplot(log(minps,10)) hist(minps[,11],main="2^10") N <- 9000 n <- 1000 K <- rchisq(N,1) k <- rchisq(n,1,ncp=6) hist(c(K,k),density=15) hist(K,density=17,col=3,add=TRUE) hist(k,add=TRUE,col=2,density=21) P <- pchisq(K,1,lower.tail=FALSE) p <- pchisq(k,1,lower.tail=FALSE) hist(c(P,p),density=15) hist(p,density=21,col=2) hist(P,density=17,add=TRUE,col=3) Pp <- c(P,p) col <- c(rep(1,N),rep(2,n)) ord <- order(Pp) plot(Pp[ord],col=col[ord],pch=20,cex=0.1,type="h") plot(Pp[ord],col=col[ord],pch=20,cex=0.1,type="h",ylim=c(0,0.1),xlim=c(0,3*10^3))

24 Combination of two distributions
Uniform p-values Small p-values

25

26

27

28 Pick smaller p-values. Threshold value should be changed for the ranks of p-values. The fraction of “true positives” is controlled.

29 Large-scale inference
When you observed many at once, their distribution is informative. The estimates of each observation using the information are different from the estimates not using the information. q-value of FDR is one type of such estimates. Use information of distribution when observed many together Empirical Bayes

30 Tests, Estimation/Inference, Classification/Clustering
Significance, Error Controlling, Multiple-testing issue Estimation/Inference Interval, Models, Bayes Classification/Clustering Unsupervised Learning vs. Supervised Learning

31 Estimation/Inference
Models, Parameters, Interval, Bayes Uniform p-values Small p-values Assuming the mixture of two distributions; This is a model.

32 Estimation/Inference
Samples → Point estimates, Interval estimates Sample distribution, Theoretical estimates, unbiased estimates,…

33 Estimation/Inference
Samples → Point estimates, Interval estimates Sample distribution, Theoretical estimates, unbiased estimates,… The statement “The star’s weight is between a and b” will be right 9 times out of 10 times.

34 Estimation/Inference
Samples → Point estimates, Interval estimates Sample distribution, Theoretical estimates, unbiased estimates,… Frequentist The statement “The star’s weight is between a and b” will be right 9 times out of 10 times.

35 Estimation/Inference
Frequentists vs.  Bayesians Frequentists approaches are difficult for students not good at mathematics and their thinking processes are not easy to follow. Instead Bayesian thinking processes tend to be easy to follow for many.

36 Estimation/Inference
Bayesian Model has parameter(s) Dara + Model → Estimation of parameter value Likelihood-based; Maximum-likelihood estimates; Interval estimates based on likelihood

37 Estimation/Inference
Frequentists vs.  Bayesians Use both, not select one of them, it is the way in 21-st century Bayesian approaches seem to be used more and more, because Models became more complicated. Computers’ assists・・・Complicated distributions can be handled simulationally Large-scale data ・・・Empirical Bayes approaches can be applied

38 Estimation/Inference
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs Estimation/Inference Frequentists vs.  Bayesians Use both, not select one of them, it is the way in 21-st century Bayesian approaches seem to be used more and more, because Models became more complicated. Computers’ assists・・・Complicated distributions can be handled simulationally Large-scale data ・・・Empirical Bayes approaches can be applied

39 Estimation/Inference
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs Estimation/Inference Frequentists vs.  Bayesians Use both, not select one of them, it is the way in 21-st century Bayesian approaches seem to be used more and more, because Models became more complicated. Computers’ assists・・・Complicated distributions can be handled simulationally Large-scale data ・・・Empirical Bayes approaches can be applied

40 Estimation/Inference
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs Estimation/Inference Frequentists vs.  Bayesians Use both, not select one of them, it is the way in 21-st century Bayesian approaches seem to be used more and more, because Models became more complicated. Computers’ assists・・・Complicated distributions can be handled simulationally Large-scale data ・・・Empirical Bayes approaches can be applied

41 Estimation/Inference
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs Estimation/Inference Frequentists vs.  Bayesians Use both, not select one of them, it is the way in 21-st century Bayesian approaches seem to be used more and more, because Models became more complicated. Computers’ assists・・・Complicated distributions can be handled simulationally Large-scale data ・・・Empirical Bayes approaches can be applied

42 Estimation/Inference
Frequentists vs.  Bayesians “Prior” distribution is necessary What is the “appropriate prior”?

43 Success rate:No information at all
Somebody you don’t know at all will take an exam on which you have no information at all. How likely do you think (s)he will pass it?

44 Success rate:No information at all
Somebody you don’t know at all will take an exam on which you have no information at all. How likely do you think (s)he will pass it? Jeffreys prior One of non-subjective priors

45 Estimation/Inference
Frequentists vs.  Bayesians Use both, not select one of them, it is the way in 21-st century Large scale inference Prior can be set based on the data set ~ empirical Bayesian

46 Tests, Estimation/Inference, Classification/Clustering
Significance, Error Controlling, Multiple-testing issue Estimation/Inference Interval, Models, Bayes Classification/Clustering Unsupervised Learning vs. Supervised Learning

47 Classification/Clustering
Multi-dimension/High-dimension, first.

48 Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs

49 Multi-dimensional/High-dimensional Data
No way to visualize high-dimensional data Almost impossible for US to understand in high-dimensional data themselves

50 Multi-dimensional/High-dimensional Data
How many dimensions can we handle? 2D space or 3D space Extra dimensions Gray/Color scale Arrows Time

51 Multi-dimensional/High-dimensional Data
Dimension reduction Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them.

52 Multi-dimensional/High-dimensional Data
Dimension reduction Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them. PCA (Principal Component Analysis)

53 Multi-dimensional/High-dimensional Data
Dimension reduction Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them. Only few dims are truly meaningful and all the others are noize. Pick the true dims.

54 Multi-dimensional/High-dimensional Data
Dimension reduction Pick up 2 or 3 dims seemingly important, then visualization is easy and we feel we can understand them. Only few dims are truly meaningful and all the others are noize. Pick the true dims. LASSO, Compression sensing

55 Multi-dimensional/High-dimensional Data
Space is high dimensional but data is low Manifold learning Put data into higher dimensional space and pull them back to low dim space.

56 High-dimensionality Many genes, many biomarkers, many features

57 Multi-dimensional/High-dimensional Data
Life-science data are high- dimensional Number of observed items are huge. But the items are mutually strongly correlated , and their dimension is much smaller in reality. FACS Ethnic diversity

58 Multi-dimensional/High-dimensional Data
Objects with low dimensions in higher dimensional space Topology

59 Multi-dimensional/High-dimensional Data
Objects with low dimensions in higher dimensional space Topology Graph, network and topology

60 Multi-dimensional/High-dimensional Data
Graph: Itemize and connect items with relation Pairwise relations are cared.

61 Multi-dimensional/High-dimensional Data
Graph: Itemize and connect items with relation Pairwise relations are cared. No care for trio-wise or higher relations.

62 Multi-dimensional/High-dimensional Data
Graph and its matrix representation and linear algebra

63 Multi-dimensional/High-dimensional Data
Graph and its matrix representation and linear algebra Graph tends to be sparse … Sparse analysis

64 Multi-dimensional/High-dimensional Data
Two important features No “common” individuals Sparse

65 High-dimensionality No commons 3.14 / 4 = 0.785
Central area : a sphere in a cubic 3.14 / 4 = 0.785

66 High-dimensionality Sparse
To estimate density, you need reasonable number of samples per small cubic volume, but… Dim = 1 : 0.1 Dim = 2 : 0.01 Dim = 3 : 0.001 …. Dime = 6 :

67 High-dimensionality Quite spacious, but reasonably dense distribution.
Distribution should be low dimensional.

68 Multi-dimensional/High-dimensional Data
Life-science data are high- dimensional Number of observed items are huge. But the items are mutually strongly correlated , and their dimension is much smaller in reality. FACS Ethnic diversity

69 Low dimensional distribution in higher dimensional space and its local density
Regular density estimation method does not work. Small cubic are still spacious in high dimensional space How to estimate local density k-nearest neighbor method In graph theory, similar idea is applicable. Minimum-spanning tree

70 Sparse in highly dimensional space

71 Sparse in highly dimensional space
How sparse? One-dimensional manifolds But significant variance

72 Sparse in highly dimensional space
How sparse? One-dimensional manifolds But significant variance

73 Sparse in highly dimensional space
How sparse? One-dimensional manifolds But significant variance Clustering

74 Tests, Estimation/Inference, Classification/Clustering
Significance, Error Controlling, Multiple-testing issue Estimation/Inference Interval, Models, Bayes Classification/Clustering Unsupervised Learning vs. Supervised Learning

75 Two types of clustering methods
Hierarchical Non-hierarchical

76 Hierarchical Tree structure --- Graph, again
Its structure has information Its structure is related with dimension On the tree, distance is defined. Some phenomena have reasons to be analyzed hierarchically.

77 Classification Separate something difficult to segregate.
J. Med. Imag. 1(3), (Oct 09, 2014). doi: /1.JMI

78 Classification/Clustering
Unsupervised Learning Supervised Learning

79 Classification/Clustering
Unsupervised Learning Supervised Learning No teacher, but want to check whether the classification criteria is reliable or not. Cross-validataion: One of resampling methods

80 Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs

81 Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs A bit more high-dimension issue

82 Small n Large p Sample size 100
Test association between a trait and expression of A gene. N = 100, p = 1 Large n Small p Test association between a trait and expression of MANY genes. N = 100, p = 25000 Small n Large p

83 n = p then, you can find the perfect regression answer
q = a x; q = 3, x = 2 → Solvable q1 = a x1 + b y1 q2 = a x2 + b y2  → Solvable q1 = a x1 + b y1 + c z1 q2 = a x2 + b y2 + c z2  q3 = a x3 + b y3 + c z3 → Solvable

84 n << p One set of variables gives the perfect answer.
Another set of variables gives the perfect but different answer. Which answer is the truth? Closer fitting is not always the best. AIC ~ Simpler model is better LASSO, Sparse The assumption k << n variables should be the answer, that is “prior” belief of Bayesian

85 Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs

86 Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches ~ Monte-Carlo method Others : Experimental Designs

87 Resampling Estimation based on samples Statistical significance
Jack-knife(Subsets)、Bootstrap(Replacement) Statistical significance Permutation ~ Exact probability Cross-validation

88 Resampling Estimation based on samples Statistical significance
Jack-knife(Subsets)、Bootstrap(Replacement) Statistical significance Permutation ~ Exact probability Cross-validation Pseudo-random generators from computers

89 Pseudo-random number sequences
From uniform distribution From other known distributions

90 Pseudo-random number sequences
From uniform distribution From other known distributions From arbitrary distributions … Gibbs sampling

91 Psuedo-random number sequences
From uniform distribution From other known distributions From arbitrary distributions … Gibbs sampling Using Gibbs sampler, Based on a stochastic model, estimate parameters of distributions and generate random values form the estimated distributions… BUGS (Bayesian inference using Gibbs Sampling)

92 Example Fraction of red vs. green is repeatedly estimated
Based on the assumption that the red is non-central chi- square distribution and its non-central parameter value is repeatedly estimated. Eventually estimate both unknown parameter values

93 Psuedo-random number sequences
From uniform distribution From other known distributions From arbitrary distributions … Gibbs sampling Using Gibbs sampler, Based on a stochastic model, estimate parameters of distributions and generate random values form the estimated distributions… BUGS (Bayesian inference using Gibbs Sampling) MCMC(Markov-Chain Monte-Carlo) With Stan (a Bayesian estimation application)

94 Pseudo-random numbers, Monte-Carlo
Computer-driven methods

95 Roles of statistics/data science for genome/omics
Quality Control of Noisy High-Throughput Data Tests, Estimation/Inference, Classification/Clustering Multi-dimensional/High-dimensional Data Random value-based approaches Others : Experimental Designs

96 Experimental designs Various data sets
Using all kinds of them, what can we state?

97

98 Individual analysis/interpretation is tough enough Integration of them is tougher
Construct a model/assumption to integrate multiple sets There are variations how to combine; order of combination, structure of combination …. Use raw data sets from multiple resources. Integrate primary outputs from each data set; so called meta-analyses Narrow-sense meta-analyses only combine outputs from similar data sets analyses. Difficulties are rooted to the heterogeneity among various data- resources. Each data-resource has its own way of analysis. The variations among analyses make the integration difficult. Then, make each analysis method unified???

99 Some resources http://statgenet-kyotouniv.wikidot.com/MasterCourse2017
Its linked sites would be helpful to broaden and deepen your understandings of todays lecture.

100 Genotypes and Phenotypes ~Data Records for Statistical Analyses~

101 Childhood Birth Death Phenome Adolescent Phenotypes Aging Development Adult Individual Fertilized Egg Metabolites Metabolome Organ Gametes Cell Tissue Genes Molecules DNA Proteins Molecular Genetics at a glance RNA Proteome Genome Transcriptome Omics Heredity Big Data Phenotypic diversity Stochastics Genetic diversity Statistics Diversity

102 Genotypes and Phenotypes
Only one Spatio-Temporally Heterogeneous Chronologically and Spatially

103 DNA/Chromosome modification
DNA,Genome, Consistent Genotype Phenotype DNA/Chromosome modification Transcriptome Intermediate phenotype Protein/Proteome Phenotype/Phenome Phenotypes of Individuals Terminal phenotype

104 Time x Space of Individual
Cancer Functional Somatic Mutations Somatic Mosaicism Parents Germline Fertilized Egg Mutations

105 Diversity in Phenotypes
Easy to measure vs. Not-easy to measure Representatives vs. Distribution itself Many but Mutually Independent vs. Mutually Correlated

106 Representatives vs. Distribution itself
Temperature A representative of molecular population Independent and Identically Distributed Variable Observation Good-shaped distribution→Representatives→Parametric approach Bad-shaped distribution→Distribution itself →Non-parametric approach One sample is a set of observations One sample gives a distribution→Representatives enough? ThermoFisher Scientific社

107 Many but Mutually Independent vs. Mutually Correlated
Multiple items mutually correlated. Chronological data(time-line) Shape data(space continuity) Movement data(Time x Space) Patten data(Informational axies) 横河電機 Nature 465, 918–921 (17 June 2010)

108 Estimation of parameter values
Data + Model  Estimation of parameter values ↓ 

109 Summary for Genotypes and Phenotypes
Data + Model  Estimation of parameter values ↓  To start your analysis Record “Values” “Values” take various shapes “Simple value” : a Number “Number” Natural Number, Integer, Rationals, Reals, Complex,Vector,Matrix … “Values” for analysis but not “Numbers” Mathematical models Biological phenomena have random errors Stochastic models, Statistical models Models have parameters, then “Values” for parameters are “Numbers” again “Simple values” are values of parameters in simple models. Complex models and their parameters can be also values for your analysis.

110 Summary for Genotypes and Phenotypes
To start your analysis Record “Values” “Values” take various shapes “Simple value” : a Number “Number” Natural Number, Integer, Rationals, Reals, Complex,Vector,Matrix … “Values” for analysis but not “Numbers” Mathematical models Biological phenomena have random errors Stochastic models, Statistical models Models have parameters, then “Values” for parameters are “Numbers” again “Simple values” are values of parameters in simple models. Complex models and their parameters can be also values for your analysis.

111 Contents of Today Genotypes and Phenotypes ~Data Records for Statistical Analyses~ Overview of Statistical Methods


Download ppt "Unit of Statistical Genetics, Kyoto University"

Similar presentations


Ads by Google