Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modelling heterogeneity in multi-gene data sets

Similar presentations


Presentation on theme: "Modelling heterogeneity in multi-gene data sets"— Presentation transcript:

1 Modelling heterogeneity in multi-gene data sets
Klaus Schliep, Barbara Holland, Mike Hendy, David Penny Allan Wilson Centre Palmerston North, NZ

2 Motivation Phylogenomic datasets may involve hundreds of genes for many species. These data sets create challenges for current phylogenetic methods, as different genes have different functions and hence evolve under different processes. One question is how best to model this heterogeneity to give reliable phylogenetic estimates of the species tree.

3 Example Rokas et al. (2003) produced 106 gene trees for 8 yeast taxa
C. albicans S. kluyveri S. kudriavzevii S. bayanus S. cerevisiae S. paradoxus S. mikatae S. castellii 3

4 Two extremes How many parameters do we need to adequately represent the branches of all (unrooted) gene trees ? Between 13 (consensus tree) & 13 x 106 = 1378 Too few parameters introduces bias Too many parameters increases the variance

5 Stochastic partitioning
Attempts to cluster genes into classes that have evolved in a similar fashion. Each class is allowed its own set of parameters (e.g. branch lengths or model of nucleotide substitution)

6 Algorithm overview Randomly assign the n genes to k classes.
Optimise parameters for each class Compute the posterior probability for each gene with the parameters from each class. Move each gene into the class for which it has highest posterior probability Go to step 2, when no genes change class STOP

7

8

9

10

11

12

13

14

15

16 How many classes?

17 Gene ontology

18 A different approach Allow each gene to have its own set of parameters
BUT penalise models where the parameters are too different from each other.

19 Penalized (log-)likelihood
where i are the parameters for the i-th gene tree, K is a symmetric matrix, and  is the penalty term.

20 Number of parameters Hastie and Tibshirani (1990) give an approximation for the number of degrees of freedom for a penalized likelihood estimator: This allows us to choose the best λ value using AIC or BIC.

21

22

23

24

25

26

27

28

29

30

31

32 Summary Tame statisticians are useful too!


Download ppt "Modelling heterogeneity in multi-gene data sets"

Similar presentations


Ads by Google