Reconstructing Genealogies: a Bayesian approach Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Department of Mathematics and Statistics 4.1.2006.

Reconstructing Genealogies: a Bayesian approach Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Department of Mathematics and Statistics 4.1.2006

We all are related … but to different degrees … Consider a population evolving in time Inverse problem: Suppose the current state of the process is known -individuals alive at the moment What was the path leading to this state? -family structures (pedigree) -inheritance patterns

Pedigrees Specify relationship categories Parent-offspring, full siblings / half siblings, first cousins etc. In graphs Circles for females, squares for males Black nodes represent nuclear families Time runs downwards

Gene flow Alleles (i.e. different variants of the same gene) flow through the pedigree Gene flow gives us a means to quantify the degree of relatedness between individuals How much of their genome do two individuals share? At what loci do they have identical alleles? DNA chromosome allele

Gene flow Two alleles may be identical by-state (IBS) -They have the same DNA-sequence by-descent (IBD) -They descend from the same ancestral allele within a given reference frame Here the children share allele 1 IBS, but not IBD (w.r.t their parents’ generation).

Meiosis When gametes are formed the paternal and the maternal chromosomes (haplotypes) may cross-over and recombine

Haldane’s model of recombination Recombination fraction θ between two loci on the same chromosome is the proportion of meioses in which a recombination event (i.e., an odd number of cross-overs) takes place between the loci Haldane’s model assumes that crossovers occur independently along each chromosome a Poisson process model follows 17% 9%9.5% chromosome

The frame for study From now on we assume that we have fixed A population whose size we know for T-1 (non-overlapping) generations backwards in time N sampled individuals from the current generation Marker map with M markers and known recombination fractions Allele frequencies at the population level for each of the markers

A (prior) model for a possible history A configuration C consists of a pedigree allelic paths Specify probabilities for Pedigree graph, P g (C) Recombination events, P r (C) Founder alleles, P a (C) The total probability for C is P(C) = P g (C) x P r (C) x P a (C)

A probability model for pedigrees For fixed number of generations,T-1, backwards in time population size in each generation (number of ♂ and ♀) sample of size N from the current generation mating parameters α and β To simulate a pedigree from the distribution we se Proceed generation by generation from 0,…,T-1. Let children choose parents according to Pólya urn scheme, where α affects the correlation of choices of fathers and β affects the correlation of choices of mothers given the choices of fathers. Gasbarra D, Sillanpää M, Arjas E (2005) Backward Simulation of Ancestors of Sampled Individuals. Theor Pop Biol 67:75-83.Theor Pop Biol 67:75-83.

Examples with different parameters Left: a few dominant males + monogamy Middle: a few dominant males Right: Random mating

Probability for allelic paths For each non-founder haplotype in the pedigree form the expression Take the product of these over all haplotypes to obtain P r (C) Consider all founder alleles and take the product of the corresponding population allelle frequencies to get P a (C)

Data Assume that we also have Genotype data of the sampled individuals on M markers The (posterior) probability in our model is π(C) ~ P g (C) x P r (C) x P a (C) x 1(C cons. with the data) We are able to sample efficiently from the prior but not from the posterior

Markov chain Monte Carlo sampling We generate a Markov chain whose state space consists of all configurations consistent with the data and whose stationary distribution is our posterior (Metropolis-Hastings algorithm) If this chain is irreducible then the expected values of functions defined on the space of configurations can be approximated with sample averages haplotype configurations IBD-sharing between individuals

Metropolis-Hastings algorithm M-H algorithm produces a chain of configurations, where at each step of the chain a new value is proposed (from some proposal distribution) and this value is either accepted ( ) or rejected ( ) (according to some rules depending on ). Good proposals that will be accepted quite often are needed so that the chain moves around within a reasonable amount of time.

Proposals Highly dependent variables (close relatives and linked markers) require large block updates Different versions of proposals A (randomly chosen) group of children chooses (possibly new) parents transmitting their alleles to these parents All children of a particular father/mother choose a (possibly new) mother/father transmitting their alleles to her/him One child at a time chooses new parent(s) transmitting alleles to them All children within the group jointly choose new parents and transmit alleles Pedigree is not changed but new allele paths are proposed

An example Simulated pedigree 10 generations Youngest generation 39 individuals divided into 13 nuclear families Population 200 founders growing exp. by 1.2

Example continues… Simulated gene flow on the pedigree 20 markers 10 equally frequent alleles at each locus in the founder generation Haldane’s model of recombination (no interference) Spacing between adjacent markers 5.3 cM (i.e. recombination fraction 0.05)

Reconstruction We gave the algorithm The genotype data on the youngest generation The (correct) marker map The (correct) allele frequencies The population structure The algorithm was run for 500,000 iterations

Reconstructing the pedigree

Reconstructing the haplotypes Each individual (in diploid species) carries two copies of each chromosome One is inherited from the father (mother) and is called a paternal (maternal) haplotype Genotyping does not (usually) determine which multilocus allelic combination is inherited from the same parent -from lab {1,2}x{4,3} -true haplotypes may be either (13,24) or (14,23) There exist two kinds of haplotyping methods Pedigree based (SimWalk2, Merlin, Genehunter) Population based (PHASE, HAPLOFREQ)

Reconstructing the haplotypes The accuracy of the haplotype reconstruction can be measured with the concept of switch distance (SD) SD between two pairs of haplotypes is the number of phase relations between neighboring loci that need to be changed in order to turn the first pair of haplotypes to the other If correct haplotypes were (111111,222222) then (111222,222111) has SD=1 (112211,221122) has SD=2 (121212,212121) has SD=5

Reconstructing the haplotypes The SDs between the reconstructed and the true haplotype pairs of the youngest generation (sum over all 39 individuals)

Reconstructing the IBD sharing We consider those alleles IBD (identical by descent) that trace back to a common ancestral allele at the founder level (9 generations backwards in time) It is possible to calculate a single quantity that measures the proportion of the genome that two individuals share (coefficient of relatedness r) It is also possible to compare the IBD sharing more accurately along the chromosome

Reconstructing IBD The reconstructed relatedness coefficients of each of the 741 pairs of the individuals belonging to the youngest generation were compared with the true values (sum of squared errors shown)

Comparison with IBS-based estimators Distribution of L_2 errors (741values) 1.933.253.273.51 Sums:

Reconstructing IBD

Another example of pedigree reconstruction Population with 200 individuals, 50 markers / 9 alleles

Future work Possibility of fixing some parts of the pedigree Extending partially known genotype data to the known pedigree in accordance with the Mendelian rules of inheritance is in general an NP-complete problem a/b a/f f/c b/e c/b e/c a/f d/d f/c b/ec/be/ca/b b/e a/c d/ab/c d/f d/d e/f a/bc/d

Future work with the reconstruction algorithm Adding a QTL (quantitative trait locus) model to the algorithm Does phenotype correlate with IBD-sharing at some chromosomic region(s)? Running many chains in parallel ”in different temperatures”

Thanks Dario Gasbarra, Mikko Sillanpää and Matti Pirinen

Reconstructing Genealogies: a Bayesian approach Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Department of Mathematics and Statistics 4.1.2006.

Similar presentations

Presentation on theme: "Reconstructing Genealogies: a Bayesian approach Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Department of Mathematics and Statistics 4.1.2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reconstructing Genealogies: a Bayesian approach Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Department of Mathematics and Statistics 4.1.2006.

Similar presentations

Presentation on theme: "Reconstructing Genealogies: a Bayesian approach Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Department of Mathematics and Statistics 4.1.2006."— Presentation transcript:

Similar presentations

About project

Feedback