Coalescent Module- Faro July 26th-28th 04 www.coalescentwww.coalescent.dk Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography.

Slides:



Advertisements
Similar presentations
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Advertisements

STATISTICS Univariate Distributions
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
Chapter 7 Sampling and Sampling Distributions
Sampling Distributions
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.
Discrete time Markov Chain
5-1 Chapter 5 Theory & Problems of Probability & Statistics Murray R. Spiegel Sampling Theory.
A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human.
Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,
Hydrologic Statistics Reading: Chapter 11, Sections 12-1 and 12-2 of Applied Hydrology 04/04/2006.
Preview What does Recombination do to Sequence Histories. Probabilities of such histories. Quantities of interest. Detecting & Reconstructing Recombinations.
Introduction to Queuing Theory
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Chapter 8 Estimation Understandable Statistics Ninth Edition
9. Two Functions of Two Random Variables
Commonly Used Distributions
Probabilistic Reasoning over Time
The Coalescent Theory And coalescent- based population genetics programs.
Population Genetics, Recombination Histories & Global Pedigrees Finding Minimal Recombination Histories Global Pedigrees Finding.
Recombination and genetic variation – models and inference
Sampling distributions of alleles under models of neutral evolution.
Phylogenetic Trees Lecture 4
Coalescence with Mutations Towards incorporating greater realism Last time we discussed 2 idealized models – Infinite Alleles, Infinite Sites A realistic.
A New Model for Coalescent with Recombination Zhi-Ming Ma ECM2013 PolyU
Preview What does Recombination do to Sequence Histories. Probabilities of such histories. Quantities of interest. Detecting & Reconstructing Recombinations.
N-gene Coalescent Problems Probability of the 1 st success after waiting t, given a time-constant, a ~ p, of success 5/20/2015Comp 790– Continuous-Time.
Forward Genealogical Simulations Assumptions:1) Fixed population size 2) Fixed mating time Step #1:The mating process: For a fixed population size N, there.
Effective Population Size Real populations don’t satisfy the Wright-Fisher model. In particular, real populations exhibit reproductive structure, either.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
2: Population genetics break.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
The Human Genome (Harding & Sanger) * *20  globin (chromosome 11) 6*10 4 bp 3*10 9 bp *10 3 Exon 2 Exon 1 Exon 3 5’ flanking 3’ flanking 3*10 3.
Combinatorics & the Coalescent ( ) Tree Counting & Tree Properties. Basic Combinatorics. Allele distribution. Polya Urns + Stirling Numbers. Number.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 3: population genetics I: mutation and recombination
Population genetics. Population genetics concerns the study of genetic variation and change within a population. While for evolving species there is no.
Trees & Topologies Chapter 3, Part 1. Terminology Equivalence Classes – specific separation of a set of genes into disjoint sets covering the whole set.
Models and their benefits. Models + Data 1. probability of data (statistics...) 2. probability of individual histories 3. hypothesis testing 4. parameter.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Coalescent Models for Genetic Demography
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Population genetics. coalesce 1.To grow together; fuse. 2.To come together so as to form one whole; unite: The rebel units coalesced into one army to.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Restriction enzyme analysis The new(ish) population genetics Old view New view Allele frequency change looking forward in time; alleles either the same.
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
A Little Intro to Statistics What’s the chance of rolling a 6 on a dice? 1/6 What’s the chance of rolling a 3 on a dice? 1/6 Rolling 11 times and not getting.
Modelling evolution Gil McVean Department of Statistics TC A G.
8 and 11 April, 2005 Chapter 17 Population Genetics Genes in natural populations.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Lecture 6 Genetic drift & Mutation Sonja Kujala
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
L4: Counting Recombination events
Goals of Phylogenetic Analysis
Estimating Recombination Rates
The coalescent with recombination (Chapter 5, Part 1)
Recombination, Phylogenies and Parsimony
Trees & Topologies Chapter 3, Part 2
Trees & Topologies Chapter 3, Part 2
Outline Cancer Progression Models
Presentation transcript:

Coalescent Module- Faro July 26th-28th 04 Monday H: The Basic Coalescent W: Forest Fire W: The Coalescent + History, Geography & Selection H: The Coalescent with Recombination Tuesday H: Recombination cont. W: The Coalescent & Combinatorics HW: Computer Session H: The Coalescent & Human Evolution Wednesday H: The Coalescent & Statistics HW: Linkage Disequilibrium Mapping

 globin Exon 2 Exon 1 Exon 3 5’ flanking 3’ flanking (chromosome 11) Zooming in! (from Harding + Sanger) *5.000 *20 6*10 4 bp 3*10 9 bp *10 3 3*10 3 bp ATTGCCATGTCGATAATTGGACTATTTTTTTTTT30 bp

From Cavalli-Sforza,2001 Human Migrations

Data:  -globin from sampled humans. From Griffiths, 2001 Assume: 1. At most 1 substitution per position. 2.No recombination Reducing nucleotide columns to bi- partitions gives a bijection between data & unrooted gene trees. C G

Africa Non-Africa 0.2 Mutation rate: 2.5 Rate of common ancestry: 1 Past Present Simplified model of human sequence evolution. Wait to common ancestry: 2N e

From Griffiths, 2001

Models and their benefits. Models + Data 1. probability of data (statistics...) 2. probability of individual histories 3. hypothesis testing 4. parameter estimation

Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of sequenced data Parameter Estimation Model Testing Coalescent Theory in Biology www. coalescent.dk TGTTGT CATAGT CGTTAT

Haploid Model Diploid Model Wright-Fisher Model of Population Reproduction i. Individuals are made by sampling with replacement in the previous generation. ii. The probability that 2 alleles have same ancestor in previous generation is 1/2N Individuals are made by sampling a chromosome from the female and one from the male previous generation with replacement Assumptions 1.Constant population size 2.No geography 3.No Selection 4.No recombination

10 Alleles’ Ancestry for 15 generations

Mean, E(X 2 ) = 2N. Ex.: 2N = , Generation time 30 years, E(X 2 ) = years. Waiting for most recent common ancestor - MRCA P(X 2 = j) = (1-(1/2N)) j-1 (1/2N) Distribution until 2 alleles had a common ancestor, X 2 ?: P(X 2 > j) = (1-(1/2N)) j P(X 2 > 1) = (2N-1)/2N = 1-(1/2N) 1 2N j j

P(k):=P{k alleles had k distinct parents} 1 2N 1 2N *(2N-1) *..* (2N-(k-1)) =: (2N) [k] (2N) k k -> any k -> k k -> k-1 Ancestor choices: k -> j For k << 2N: S k,j - the number of ways to group k labelled objects into j groups.(Stirling Numbers of second kind.

Geometric/Exponential Distributions The Geometric Distribution: {1,..} Geo(p): P{Z=j)=p j (1-p) P{Z>j)=p j E(Z)=1/p. The Exponential Distribution: R+ Exp (a) Density: f(t) = ae -at, P(X>t)= e -at Properties: X ~ Exp(a) Y ~ Exp(b) independent i. P(X>t 2 |X>t 1 ) = P(X>t 2 -t 1 ) (t 2 > t 1 ) ii. E(X) = 1/a. iii. P(Z>t)=(≈)P(X>t) small a (p=e -a ). iv. P(X < Y) = a/(a + b). v. min(X,Y) ~ Exp (a + b).

corresponds to 2N generations N 0 6 6/2N e t c :=t d /2N e Discrete  Continuous Time

Probability for two genes being identical: P(Coalescence < Mutation) = 1/(1+  ). m mutation pr. nucleotide pr.generation. L: seq. length µ = m*L Mutation pr. allele pr.generation. 2N e - allele number.  := 4N*µ -- Mutation intensity in scaled process. Adding Mutations sequence time Discrete time Discrete sequence Continuous time Continuous sequence 1/L 1/(2N e ) time sequence  /2 mutation coalescence Note: Mutation rate and population size usually appear together as a product, making separate estimation difficult. 1

The Standard Coalescent Two independent Processes Continuous: Exponential Waiting Times Discrete: Choosing Pairs to Coalesce WaitingCoalescing (4,5) (1,2)--(3,(4,5)) 1--2 {1}{2}{3}{4}{5} {1,2}{3,4,5} {1,2,3,4,5} {1}{2}{3}{4,5} {1}{2}{3,4,5}

Expected Height and Total Branch Length Expected Total height of tree: H k = 2(1-1/k) i.Infinitely many alleles finds 1 allele in finite time. ii. In takes less than twice as long for k alleles to find 1 ancestors as it does for 2 alleles. Expected Total branch length in tree, L k : 2*(1 + 1/2 + 1/ /(k-1)) ca= 2*ln(k-1) k 1/ /(k-1) Time Epoch Branch Lengths

B. The Paint Box & exchangable distributions on Partitions. C. All coalescents are restrictions of “The Coalescent” – a process with entrance boundary infinity. D. Robustness of “The Coalescent”: If offspring distribution is exchangeable and Var( 1 ) -->  2 & E( 1 m ) < M m for all m, then genealogies follows ”The Coalescent” in distribution. E. A series of combinatorial results. Kingman (Stoch.Proc. & Appl other articles,1982) A. Stochastic Processes on Equivalence Relations.  ={(i,i);i= 1,..n}  ={(i,j);i,j=1,..n} 1 if s < t q s,t = 0 otherwise This defines a process, R t, going from to through equivalence relations on {1,..,n}.

Effective Populations Size, N e. In an idealised Wright-Fisher model: i. loss of variation per generation is 1-1/(2N). ii. Waiting time for random alleles to find a common ancestor is 2N. Factors that influences N e : i. Variance in offspring. WF: 1. If variance is higher, then effective population size is smaller. ii. Population size variation - example k cycle: N 1, N 2,..,N k. k/N e = 1/N /N k. N 1 = 10 N 2 = 1000 => N e = 50.5 iii. Two sexes N e = 4N f N m /(N f +N m )I.e. N f - 10 N m N e - 40

6 Realisations with 25 leaves Observations: Variation great close to root. Trees are unbalanced.

Sampling more sequences The probability that the ancestor of the sample of size n is in a sub-sample of size k is Letting n go to infinity gives (k-1)/(k+1), i.e. even for quite small samples it is quite large.

Three Models of Alleles and Mutations. Infinite Allele Infinite Site Finite Site acgtgctt acgtgcgt acctgcat tcctgcat acgtgctt acgtgcgt acctgcat tcctggct tcctgcat i. Only identity, non-identity is determinable ii. A mutation creates a new type. i. Allele is represented by a line. ii. A mutation always hits a new position. i. Allele is represented by a sequence. ii. A mutation changes nucleotide at chosen position.   

Infinite Allele Model

Final Aligned Data Set: Infinite Site Model

Number of paths:

{},, Ignoring mutation position Ignoring sequence label Ignoring mutation position Ignoring sequence label Labelling and unlabelling:positions and sequences 9 coalescence events incompatible with data 4 classes of mutation events incompatible with data The forward-backward argument

Infinite Site Model: An example Theta=

Impossible Ancestral States

Final Aligned Data Set: acgtgctt acgtgcgt acctgcat tcctgcat s s s Finite Site Model

1) Only substitutions. s1 TCGGTA s1 TCGGA s2 TGGT-T s2 TGGTT 2) Processes in different positions of the molecule are independent. 3) A nucleotide follows a continuous time Markov Chain. 4) Time reversibility: I.e. π i P i,j (t) = π j P j,i (t), where π i is the stationary distribution of i. This implies that Simplifying assumptions 5) The rate matrix, Q, for the continuous time Markov Chain is the same at all times. = a N1N1 N2N2 l 2 +l 1 l1l1 l2l2 N2N2 N1N1

Evolutionary Substitution Process t1t1 t2t2 C C A P i,j (t) = probability of going from i to j in time t. 

Jukes-Cantor 69: Total Symmetry. -3*      -3*      -3*      -3*  TO A C G T FROM A.Stationary Distribution: (.25,.25,.25,.25) B. Expected number of substitutions: 3  t ACGTACGT 0 t Higher Cells ChimpMouse Fish E.coli ATTGTGTATATAT….CAG ATTGCGTATCTAT….CCG

History of Coalescent Approach to Data Analysis s: Genealogical arguments well known to Wright & Fisher. 1964: Crow & Kimura: Infinite Allele Model 1968: Motoo Kimura proposes neutral explanation of molecular evolution & population variation. So does King & Jukes 1971: Kimura & Otha proposes infinite sites model. 1972: Ewens’ Formula: Probability of data under infinite allele model. 1975: Watterson makes explicit use of “The Coalescent” 1982: Kingman introduces “The Coalescent”. 1983: Hudson introduces “The Coalescent with Recombination” 1983: Kreitman publishes first major population sequences.

History of Coalescent Approach to Data Analysis : Griffiths, Ethier & Tavare calculates site data probability under infinite site model : Griffiths-Tavaré + Kuhner-Yamoto-Felsenstein introduces highly computer intensitive simulation techniquees to estimate parameters in population models Krone-Neuhauser introduces selection in Coalescent Donnelly, Stephens, Fearnhead et al.: Major accelerations in coalescent based data analysis : Several groups combines Coalescent Theory & Gene Mapping. 2002: HapMap project is started.

Basic Coalescent Summary i. Genealogical approach to population genetics. ii. ”The Coalescent” - generic probability distribution on allele trees. iii. Combining ”The Coalescent” with Allele/Mutation Models allows the calculation the probability of data.