Presentation is loading. Please wait.

Presentation is loading. Please wait.

PART II: Recombination and selection

Similar presentations


Presentation on theme: "PART II: Recombination and selection"— Presentation transcript:

1 PART II: Recombination and selection

2 Summary of assumptions so far
We have covered the role of chance (parent choice), demography (population size) and mutation, in shaping genetic diversity Neutral Wright-Fisher models Parents are always chosen at random. This means no new mutations affect reproductive success of parents, i.e. no “selection” for positive or negative changes. A gene – i.e. a segment of DNA - is always inherited as a complete unit from the chosen parent. This assumes no “recombination”. Recombination is a process allowing different positions along a segment of DNA to be inherited from distinct parents Discrete generations Population size N Parents chosen at random Mutations probability m Sample history can be constructed

3 Summary of assumptions so far
Until now, the course has concentrated on models which are heavily simplified Neutral Wright-Fisher models Parents are always chosen at random. This means no new mutations affect reproductive success of parents, i.e. no “selection” for positive or negative changes. Also called neutrality. A gene – i.e. a segment of DNA - is always inherited as a complete unit from the chosen parent. This assumes no “recombination”. Recombination is a process allowing different positions along a segment of DNA to be inherited from distinct parents

4 Why relax these assumptions?
In fact, we are (obviously!) evolved to adapt to our environment. This process occurs through natural selection. Some new mutations are favoured, because those carrying them have more children on average. To quantitatively study evolution, we need models incorporating this idea.

5 Why relax these assumptions?
Our genome has essential functions. Many new mutations would disrupt this function (far more than confer useful new advantages), so must be prevented from becoming common in the population This process also occurs through natural selection. Some new mutations are “deleterious”, because those carrying them have fewer children on average. Selection can act in both directions. Disease Population frequency Sickle cell anemia 1 in 625 (African Americans) Cystic fibrosis 1 in 2,000 (Europeans) Tay-Sachs disease 1 in 3,000 (US Jewish population) Haemophilia 1 in 10,000 Galactosemia 1 in 57,000

6 Example: Human data Data for ENR131, Chromosome 2q, Chinese and Japanese population sample (The International HapMap Consortium, Nature 2005) D’ Association measure According to the assumptions so far, a region has a history given by a tree We should not see any obvious decay of association between sites with distance What’s going on?

7 1.1 Why relax these assumptions?
Recombination In humans, and many other species, a process of recombination occurs: This can mean different positions on a chromosome are inherited from different chromosomes in the parental generation. So they have different histories. Our models for genetic data need to allow for this. We will begin by thinking about recombination (without selection initially). We have chromosome pairs, one inherited from each parent Father Mother Only one of the two maternal (or paternal) copies is passed down Child Almost always, rather than choosing one or other, a mosaic is constructed

8 PART II: Recombination and selection
We will extend our theory to cover the other two main biological forces driving genetic variation, evolution, and e.g. disease risk: Recombination The effect of recombination on ancestry Detecting historical recombination Incorporating recombination into the coalescent framework Properties of the “ARG” Real inference of recombination rate Natural selection The fate of individual mutations Modelling selection Properties of selected alleles

9 1.2 Recombination model D Suppose we are thinking about a segment D of DNA in a single chromosome Sites If S is large, reasonable to think of this as a continuous segment D=[0,1] In a single generation, at most one recombination can occur in D : When recombination occurs, we pick the (left) breakpoint B from a density function f on D. We will normally assume (wlog): In humans, the per site per generation recombination rate averages ~1x10-8 versus a mutation rate of 1.3x10-8. Probability 1-r Single parent chromosome Probability r Two parents chosen

10 We begin by considering a general population, including recombination
Generations shown as discrete only for simplicity How do we represent histories with recombination? Later we will add additional modelling assumptions (random mating, etc). Each chromosome chooses parent in previous generation Single parent probabilty 1-r Two parents, probability r Denote by double arrow, left parent single line, right parent double line Probability density function f

11 We begin by considering a general population, including recombination
Generations shown as discrete only for simplicity How do we represent histories with recombination? Later we will add additional modelling assumptions (random mating, etc). Each chromosome chooses parent in previous generation Single parent probabilty 1-r Two parents, probability r Denote by double arrow, left parent single line, right parent double line Probability density function f

12 We begin by considering a general population, including recombination
Generations shown as discrete only for simplicity How do we represent histories with recombination? Later we will add additional modelling assumptions (random mating, etc). Each chromosome chooses parent in previous generation Single parent probabilty 1-r Two parents, probability r Denote by double arrow, left parent single line, right parent double line Probability density function f

13 We begin by considering a general population, including recombination
Generations shown as discrete only for simplicity How do we represent histories with recombination? Later we will add additional modelling assumptions (random mating, etc). Each chromosome chooses parent in previous generation Single parent probabilty 1-r Two parents, probability r Denote by double arrow, left parent single line, right parent double line Probability density function f

14 We can trace ancestral histories in the new setting
At a recombination event, choose the appropriate ancestor Consider site 1 (position 0 in [0,1]). Each chromosome chooses parent in previous generation Single parent probabilty 1-r Two parents, probability r Denote by double arrow. Left parent single line, right parent double line Probability density function f

15 Now consider site S (position 1 in [0,1])
At a recombination event, choose the appropriate ancestor Always choose the right hand ancestor at a recombination event Each chromosome chooses parent in previous generation Single parent probabilty 1-r Two parents, probability r Denote by double arrow. Left parent single line, right parent double line Probability density function f

16 Site S/2 (position 0.5 in [0,1])
At a recombination event, choose the appropriate ancestor Ancestor choice depends on position of recombination event Each chromosome chooses parent in previous generation Single parent probabilty 1-r Two parents, probability r Denote by double arrow. Left parent single line, right parent double line Probability density function f

17 1.3 Marginal trees With recombination, we can still draw a genealogical tree at each site. At a position x in [0,1], we define the marginal tree T(x) to be the genealogical tree at x. In general, T(x) depends on x. The TMRCA can also change along the sequence. Tree change points are a subset of the recombination positions Time T(0) T(0.5) T(1) In humans, genealogical trees are typically hundreds of thousands of years deep (tens of thousands of generations) For a recombination event at x, T(x-) and T(x+) can be, but are not always, different (see problem sheet) Question: Is this the best way to summarise information about the history of the sample?

18 1.4 The ancestral recombination graph
Individual trees for each site are cumbersome They are not sufficient in general to reconstruct all historical recombination events problematic if recombination is the focus of interest The ancestral recombination graph (ARG) solves this problem (Griffiths 1991, Griffiths and Marjoram 1997, Hudson 1983) Provides an efficient way to record the history of a sample with recombination, without losing information This is a directed, acyclic graph of degree three. Nodes correspond to ancestors of the sample

19 1.5 The ARG Time Join edges when ancestors coalesce Each tip corresponds to an individual chromosomal segment Every ancestral recombination graph corresponds uniquely to an ancestral history of the sample

20 1.5 The ARG Split edges at recombination events. Left branch contributes material to left of break Time 0.2 Join edges when ancestors coalesce 0.9 Each tip corresponds to an individual chromosomal segment Every ancestral recombination graph corresponds uniquely to an ancestral history of the sample

21 1.5 The ARG Eventually a most recent common ancestor (MRCA) will be reached 0.6 0.7 0.2 0.9 Split edges at recombination events. Left branch contributes material to left of break Time Join edges when ancestors coalesce Each tip corresponds to an individual chromosomal segment Every ancestral recombination graph corresponds uniquely to an ancestral history of the sample

22 1.6 Example ARGs Recombination events can change the tree “topology” (a) Can leave the tree “topology” unchanged but alter the times in the tree (b) Can leave the tree completely unchanged (c) Sample size n=4, single recombination event (a) (b) (c)

23 1.7 (Embedded) Marginal trees
0.6 0.7 0.2 0.9 Time T(0) T(0.5) T(1) Marginal trees are recovered from the graph by taking the appropriate branch at each recombination event

24 1.8 Embedded subgraphs To obtain the ARG for a subregion say [a,b] we take the ARG for [0,1] and remove recombination events in [0,a) or (b,1], and respectively the left and right edges ancestral to these recombination events. These events occur outside [a,b] They therefore cannot affect the history of this subregion so must be outside the subregion ARG Essentially we “drop” irrelevant edges 0.2 0.7 0.6 0.9 [0.4,0.8]

25 2.1 Mutations in the ARG Suppose a mutation occurs in some sample ancestor Add a mark to the ARG, at the appropriate position in [0,1], to the place corresponding to that ancestor The entire mutational history can be placed on the graph. 0.6 0.7 0.2 0.9 0.75 0.05 0.3 0.35 0.72 0.65 0.4 0.5 0.8 Sequence 0.05 0.3 0.35 0.4 0.5 0.65 0.72 0.75 0.8 0.9 1 2 3 4 5 6 7 8

26 2.2 The effect of recombination on data
Suppose we are interested in performing inference on how much recombination there has been We cannot directly observe the ARG Instead, we need to indirectly infer recombination using mutation patterns in data Later we will investigate in depth stochastic models of the effect of recombination These can be used to obtain parametric estimates of recombination rate parameters An alternative approach is to not impose a particular model, but simply try to count how many recombination events occurred in a sample history Advantages: Simple, easy to interpret in terms of counts, robust, requires few assumptions Provides insight into relationship between data and recombination history Disadvantages: Hard to interpret results in terms of underlying recombination parameters Misses many recombination events Difficult to quantify uncertainty about how many events occurred

27 2.3 Reminder of infinite sites model
Definition 2.3.1: Infinitely-many-sites model Mutations occur at positions on the DNA sequences never before mutant. Every mutation occurring in the coalescent tree on an edge occurs in all genes subtended below the edge. If we assume the infinite sites model, then you have seen the following: the “4-gamete test” Proposition 2.3.2: Compatibility of mutations with the point mutations assumption An n × s 0-1 matrix is compatible with a gene tree if and only if no pattern 0 0 0 1 1 0 1 1 occurs in any two columns and four rows. If the ancestral type is known and always denoted by 0, the first row of the pattern can be removed from the condition. Question: Is this result respected if recombination occurs?

28 Example: recombination causes violation of the 4-gamete test
0.15 0.75 0.2 0.15 0.75 1 2 3 4 Note: only mutations on these two branches can violate the 4-gamete test, and that this occurs if and only if the blue mutation occurs to the left of 0.2, and the black mutation to the right of 0.2

29 2.4 Detecting recombination events (Hudson and Kaplan, 1985)
Lemma 2.4: The 4-gamete test Suppose we have variation data for n individuals at s sites, represented as an n × s 0-1 matrix. Under the infinite sites model, if the pattern 0 0 0 1 1 0 1 1 occurs in two columns corresponding to positions x and y, then at least one recombination event must have occurred in the sample history, in the interval (x,y). If the ancestral type is known and denoted by 0 at x,y then the first row of the pattern can be removed from the condition. Proof We prove the converse statement. Suppose there are no recombination events between x and y. Then the ancestral recombination graph for the interval [x,y] is simply a coalescent tree. Hence, by proposition the above pattern cannot occur in the data.

30 2.5 Hudson’s RM (Hudson and Kaplan 1985)
Suppose we have sites 1,2,..,10 and the dataset: Sequence 0.05 0.3 0.35 0.4 0.5 0.65 0.72 0.75 0.8 0.9 1 2 3 4 5 6 7 8 How many recombination events?

31 2.5 Hudson’s RM (Hudson and Kaplan 1985)
Proposition 2.5: Hudson’s RM Under the infinite sites model with recombination, suppose we have data for n sequences at s (ordered) segregating sites 1,2,..,s. Then the following recursive procedure gives a minimum number of recombination events in the history of the sample, based on the results of the four gamete test. Step 1: For all pairs (i,j), construct a matrix R where Rij=1 if sites i and j show all 4 gametes, and 0 otherwise. Step 2: Set i=1,l=2 and RM=0 Step 3: If max{Rkl: k=i,..,l-1}=1 then increment RM by 1 and reset i’=l. Otherwise, set i’=i Step 4: If l=n, terminate. Otherwise, set i=i’, l=l+1 and return to step 3. Remark The idea here is to go from left to right, putting in a recombination whenever one is required by the 4 gamete test, and that recombination must have happened to the right of the furthest right recombination placed so far.

32 Application of the algorithm
All other Rij=0. i RM l 1 2 2 1 3 2 1 4 2 1 5 2 1 6 6 2 7 7 3 8 7 3 9 7 3 10 7 3 11 11 4 12 11 4 13 11 4 14 11 4 15 11 4 16 - 5

33 Proof of proposition 2.5 The result is trivial if Rij=0 for all i and j. Otherwise, let the true minimum number of events based on the 4-gamete test be W. Suppose wlog that RM is incremented by 1 at RM steps corresponding to values l1, l2 ,.. ,lR of l. Setting l0=1, at these steps i therefore takes values l0, l1 ,.. ,lR-1 respectively because i is reassigned the current value of l at each increase in RM. We prove first that , then that . For each i, by construction Then there must be recombination in the interval (li-1,li) for each i and as there are RM such intervals, To prove , suppose we place RM recombination events along the sequence by placing one event in each interval (li-1,li) i=1,2,..., RM. Supposing for a contradication that this did not provide a solution, there must exist p, q such that Rpq=1 but no event is placed inside the interval (p,q). In the qth round of the algorithm, l=q and so since Rpq did not produce an increase in RM, we must have not considered this bound: q>iq>p. This implies iq>1 and hence iq=lm for some m>0. Thus there is a recombination placed in the interval (lm-1,lm). However lm>p so (lm-1,lm) contains an event placed within (p,q), a contradiction.

34 Example: Drosophila data
Chromosome 4 in three Drosophila species Is there evidence for recombination? None seen in thousands of “crosses” Arguello et al. MBE 2009 Sequenced 80 genes Definitively recombination, at a low rate More recombination in D. simulans than in other species Suggests deeper ARGs (larger population size) for this species

35 2.6 Properties of RM RM provides a simple, constructible measure of the influence of recombination on a sample of sequences This has led to its use in large real datasets by researchers RM relies on mutations in suitable places to detect recombination events, so if the mutation rate is not very high, typically drastically underestimates the number of recombination events (Hudson and Kaplan, 1985). Under a coalescent model, expectation of RM grows extremely slowly with sample size n – no faster than log(log(n)) In general, recombination events are much more challenging to detect directly, and study, than mutation events Better bounds are also available, which extend the ideas used to construct RM (Myers and Griffiths 2003, Hein 1990, Song and Hein 2004, 2005, Bafna and Bansal 2006, Lynsgo, Song et al. 2008, Liu and Fu 2008, and more)

36 2.7 Haplotype patterns and recombination events
Consider the following toy dataset. How many recombination events are required? RM =1 It is clear that under the infinite sites model, the first event back in time must be a recombination event. No matter which of the sequences we decide to recombine, after this event there will still be 5 unchanged sequences (Exercise) no matter what choice we made, these 5 sequences still indicate recombination (4-gamete test) So we need at least one more recombination event in the history of these sequences, and RM could be improved to 2. How can we do better? One approach is to use haplotype information 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 0 1

37 2.8 The haplotype bound Proposition 2.8: The haplotype bound
Under the infinite sites model with recombination, suppose we have data for n sequences at s segregating sites 1,2,..,s. Suppose that the n x s data matrix for these sequences has H unique rows, or haplotypes. Then a lower bound on the number of recombination events in the history of the sample is H-s-1. Proof Consider the ancestral recombination graph representing the history of the sample. Beginning with the ancestral sequence at the TMRCA, we can view our sample haplotypes as being created forward in time. Since there are H haplotypes, only one of which can be the ancestral type, there must be at least H-1 further events in the history creating novel types. Each mutation or recombination event can create at most one novel type. Coalescence events simply duplicate existing types (forward in time). By the infinite sites assumption there are s mutation events, so if R is the number of recombination events we must have R+s>=H-1. Remark From the proof, if the ancestral type is known, we can add it to our collection of haplotypes. Note also that the four gamete test is just the special case s=2.

38 Example (toy) dataset revisited
Consider the following toy dataset. How many recombination events are required? H=6, S=3 giving R>=6-3-1=2 This is the right answer here: a history with 2 events is possible (hint: recombine sequences 4 and 6 first) Note that given a dataset with s sites, we can Apply proposition 2.8 to any subset of t of the s sites Obtain a bound on the number of events between the first and last members of the subset This will result in a lower bound matrix Rij with positive integer entries We want to be able to combine bounds once again, to produce an overall bound for a region 0 0 0 0 1 0 1 0 0 1 1 0 0 0 1 1 0 1

39 Combining bounds 1 1 2 2 2 1 For this set of bounds, HM =5 We could keep searching site subsets Typically performance can be good if use e.g. only subsets up to size 5, up to some maximal distance apart. Clearly, software needed to calculate the bound!

40 2.9 HM Proposition 2.9: HM Under the infinite sites model with recombination, suppose the haplotype minimum gives a local bound matrix R where Rij is the best haplotype bound between sites i and j. Define HMij to be the minimum number of recombination events between sites i and j satisfying this set of bounds. Then the following is true: This recursive system can be used to obtain HM1j given HM12, HM13 ,..., HM1(j-1) and hence provides an efficient means of obtaining HM1s, the overall lower bound on recombination events Proof Let the true minimum be W. Note that the above construction means that HM1s is a sum of Rij terms corresponding to non-overlapping intervals. Thus, obviously To prove the converse statement, we construct a minimal placement of recombination events as follows.

41 2.9 HM Proof continued: Define a vector of recombination counts in the s-1 mutation intervals with rj, the number of events between mutant sites j-1 and j, given by rj= HM1j-HM1(j-1). (Take HM11=0). Supposing for a contradiction this does not satisfy the full bound set R={Rij}, we may pick j to be the minimal such where <Rij events are placed within (i,j). By the recursive formula in the construction: But then contradicting the fact that <Rij events are placed within (i,j). Note that the proof provides an explicit possible solution for where recombination events are placed. This is usually non-unique: this solution corresponds to putting events as far “right” as possible.

42 The benefits of using more information
The following charts shows the expectation of the haplotype bound (solid lines) can greatly exceed that of RM (dotted lines) especially as sample size becomes large. These expectations were calculated using the coalescent with recombination – we will come to this soon Myers and Griffiths (2003)

43 Example: the haplotype bound in humans
The following is based on real human mutation data for 10,000 bases around the LPL gene. We can plot the recombination density between pairs of sites as an x, y colour plot: Question: Is there a “hotspot” for recombination here? Caveat: Apparent clustering of recombination might be due to stochastic variation in histories. Need to model this explicity

44 Data for ENR131, Chromosome 2q, Chinese and Japanese population sample (The International HapMap Consortium, Nature 2005) D’ Association measure Rh

45 Example: Humans versus chimps
98.6% similar at aligned genomic bases These are similar plots, for aligned regions of the human and chimpanzee genomes (Winckler et al. 2005). Further(model based) analyses confirm that recombination rates are very different between humans and chimpanzees genome-wide (Winckler et al. 2005, Ptak et al. 2005, Myers et al. 2009)

46 Example: Malaria Chromosome 1 Africa Chromosome 7 Asia Africa Asia
Malaria appear to have a similar uneven distribution of recombination sites along their genomes (Mu et al., Nature Genetics 2010)

47 2.10 Conclusions on recombination detection
Direct detection of recombination events offers a very useful approach to: Understanding the influence of recombination on data Discovering the distribution of events along sequences More sophisticated approaches still have been developed in recent years (Song and Hein 2004, 2005, Bafna and Bansal 2006, Lynsgo, Song et al. 2008, Liu and Fu 2008, and more) Improvements over HM, though these are modest. All strict minima miss the large majority of recombination events In organisms with repeat mutation, need to adapt approaches (Liu and Fu, 2008) and problem even tougher A model for populations with recombination is vital to Recover more of the information from data Perform inference on underlying recombination parameters Estimate uncertainty, make statements about rate variation, make statements about particular sample histories, allow for demographic histories, selection,...

48 3.0 The Wright-Fisher model revisited
We incorporate recombination in the Wright-Fisher model: Constant size population of size 2N Generations are discrete with next generation formed from previous: Individuals choose a single parent uniformly, with probability 1-r Are recombinant, choose two parents at random and a recombination breakpoint, with probability r Can also mutate, with probability m, and choose a site to mutate. Chromosome randomly chooses parent in previous generation Single parent probabilty 1-r Two parents, probability r Denote by double arrow, left parent single line, right parent double line Probability density function f

49 3.1 The history of a sample Consider a sample of size n from the population We will define Consider the limit as while r, q remain constant. At some time back, suppose there are j ancestors of the sample remaining and consider the events in the previous generation

50 Consider the probabilities of different possible events while j>1 ancestors remain:

51 Now as for the coalescent without recombination, we measure time in units of 2N generations, define t=T/2N, and consider event probabilities as and t remains fixed. Let be the waiting time back until some event occurs, while there are j ancestors Thus, is exponentially distributed. When an event occurs: In the limit, this fully defines the ancestry process. By obvious symmetry, at coalescences a random pair coalesce, and a random sequence recombines or mutates at these respective events. This defines the coalescent with recombination:

52 3.2 The coalescent with recombination
Definition 3.2: The coalescent with recombination (Hudson 1983, Griffiths 1991, Griffiths and Marjoram 1997) The coalescent with recombination is a Markov process describing the history, backward in time of a sample of n genes drawn from a population. While j ancestors remain, j>1, the time to the next event has an exponential distribution with rate parameter After sampling the next event time, an event is chosen: Distribution on big graphs

53 3.2 The coalescent with recombination
Definition 3.2: The coalescent with recombination (Hudson 1983, Griffiths 1991, Griffiths and Marjoram 1997) At recombination events, the breakpoint is chosen using pdf f. In drawing the graph, coalescence events are represented as edge joins backward in time, recombination events as splits, and mutation events marked as points on the edges. Given a particular mutation model (specified forward in time) we first choose the ancestor type, and then choose a new mutant according to the model at each mutation point, based in general on the type of the edge immediately above the mutation event. If we are not interested in recording mutations, or investigating the genealogical relationships alone, we can simply set q=0. We usually terminate the process the first time j=1. The first ancestor of the sample where j=1 is the grand most recent common ancestor of the sample.

54 3.2 The coalescent with recombination
0.6 0.7 0.2 0.05 0.3 0.35 0.9 0.72 0.65 0.4 0.5 0.8 W1 W3 Wm 0.75 0.9

55 3.3 Properties of the coalescent with recombination
We have shown that the coalescent with recombination is the limit process (as N becomes large) describes the history of a sample drawn from a constant size Wright-Fisher model. It also arises as a limit process in other many models –with continuous or discrete generations r=0 corresponds to the standard coalescent The number of ancestors j can be thought of as a random walk. The coalescence rate grows quadratically with j while the recombination rate grows only linearly with j. Thus eventually the random walk will hit j=1 with probability 1 (exercise sheet) The expected number of recombination events before this happens satisfies the recursion (Exercise; Ethier and Griffiths 1990)

56 3.4 Description in terms of Poisson processes
We can think of the coalescent with recombination in terms of independent Poisson processes on edges and pairs of edges This construction is helpful in theoretical calculations and obtaining subgraphs For this course, we only need to restate (these facts were also used in the earlier part of the course) two general properties of homogeneous Poisson processes on the real line. Here N(t) is the number of events before time t.

57 3.4 Description in terms of Poisson processes
Exactly as without recombination, we can fully construct the ancestral recombination graph using independent Poisson processes in reverse time: Each of the j(j-1)/2 pairs of edges independently coalesces as a Poisson process with rate 1 Each of the j edges mutates at rate q/2. Each the j edges recombines at rate r/2. Events in the Poisson processes are “racing” each other To prove this gives the correct graph, we simply need to show it yields the correct rates By fact 3.4.2, while j ancestors remain, events occur as a Poisson process with total rate The time to the first event has the correct exponential distribution, by fact When the event occurs, fact implies it is e.g. a coalescence (between a random pair of edges) with probability

58 3.4 Description in terms of Poisson processes
W2 0.6 0.7 0.2 0.05 0.3 0.35 0.9 0.72 0.65 0.4 0.5 0.8 W1 W3 Wm 0.9

59 3.5 Subgraphs In 1.8, we saw that we can construct the ARG for a subregion [a,b] by ignoring all recombination (and mutation) events outside [a,b]. If recombination and mutation are uniform, we construct a graph by starting with n sequences, and backward in time introducing Recombination events at rate r(b-a)/2 per edge Mutation events at rate q(b-a)/2 per edge Coalescence at rate 1 per pair of edges Thus the ARG for a subregion is (of course) distributed according to the coalescent with recombination for the smaller region. “Small ARG”: In certain settings, we can gain efficiency by only following the history of specific branches contributing to genetic variation, building a coalescent using the Poisson process rates. Edges – or recombinations producing edges carrying no genetic material passed on to a sample, and edges carrying only material that has reached a MRCA, need not be followed. Similarly, mutations outside ancestral material need not be simulated. This graph can be produced directly (Hudson 1983) Can be much smaller than the “big ARG” Preferred for simulation for this reason

60 Supplementary remark: small ARG in the coalescent
Simulate directly by having different rates on different lineages in the past. We can measure the coalescence, mutation, recombination rates: 0.6 0.7 0.2 0.9 0.85 0.8 The small ARG does not include this recombination Simulation of the small graph is efficient (Hudson 1991) Avoid considering ancestors sharing no material with the sample

61 3.6 Marginal trees revisited
Time T(0) T(0.5) T(1) Marginal trees are recovered from the graph by taking the appropriate branch at each recombination event Note the marginal tree at x is the limit as d tends to 0 of the subgraph on [x,x+d]. In this subgraph, line pairs coalesce at rate 1, so while j ancestors remain the total coalescence rate is j(j-1)/2. Lines recombine at rate rd/2 per edge, so in the limit there is no recombination and the marginal tree at x is described by the usual coalescent. (Actually this is obvious, because we could make the tree at x based on the large size limit of a finite Wright-Fisher population directly, in which case recombination would not occur.)

62 3.7 Theoretical results for the coalescent with recombination (?)
The coalescent with recombination is much harder to derive exact results for than the coalescent These are mainly restricted to samples of size 2, or the “big ARG”, which contains some ancestors unrelated to the sample In other settings, we rely on Numerical recursions to solve Lower and upper bounding of solutions Analytic approximation of solutions We will see examples of these settings and approaches For additional analytical results, see Durrett, and Wakeley, and references therein (important papers include Hudson (1983), Hudson and Kaplan (1985), Ethier and Griffiths (1990), Griffiths and Marjoram (1997), Wiuf and Hein (1999) and others)

63 3.8 Mean and variance of the number of segregating sites
Assume the infinite sites model and a uniform mutation rate along [0,1]. Let us define Sn to be the number of mutation events in a sample of size n that occur in ancestral material and prior to the MRCA at their position. Suppose the region consists of m discrete sites where each mutates at rate q/(2m), and between each pair of which recombination occurs at rate r /(2 (m-1)). The continuous model is the limit as m→∞. Define Ti to be the total tree length at site i. Then conditional on T1,T2,..,Tm, the total number of mutations is a sum of independent Poisson random variables, so is Poisson with mean

64 3.8 Mean and variance of the number of segregating sites
Thus if Tij is the time while j ancestors remain in tree i: so the mean number of sites is unchanged relative to the no recombination case.

65 3.8 Mean and variance of the number of segregating sites
For the variance, note where fn(z) is the covariance in tree times between sites a distance z/2 recombination units apart.

66 3.8 Mean and variance of the number of segregating sites
We have It is clear that we expect fn to decrease with r, and further so as The variance is reduced relative to the no recombination case. (Hudson 1983, Griffiths and Marjoram 1997)

67 3.9 Mean and variance of the number of recombination events
Let Rn be the number of recombination events in a sample of size n that occur in ancestral material, and prior to the MRCA at their position. It was similarly shown (Hudson and Kaplan, 1985) that Note that this expectation is different from the expected number of events, En, in the big ARG: This is because events in the big ARG can happen outside ancestral material. The difference is, though, bounded as n→∞ (problem sheet). How can we calculate fn(z) ? This is actually only reasonable analytically for n=2. Mention problem sheet and “small arg”

68 3.10 Covariance in ancestry times
T(z) f2(z) is defined as the covariance in total marginal tree lengths for two sites a distance z in recombination units apart. We can focus on the small ARG subgraph for a region [0,1] with overall r=z. Let the coalescence times at 0 and 1 be T1, T2. The tree lengths are then 2T1, 2T2 so: and we “simply” need E(T1T2) for sites a distance r apart. We sketch in the supplement how this quantity is obtained, to illustrate the important approach of constructing equation systems. Idea: ignoring mutations, condition on the first event back in time that occurs in the ARG for these two sequences. This is a recombination or coalescence. Repeat this.

69 f2(r)

70 3.10 Supplement I: ancestry time covariance
T(z)

71 3.10 Supplement I: ancestry time covariance
2. Note that the conditional expectation term corresponds to the expectation for a new state, immediately following a recombination event. By the Markov property of ARGs, this is the expected product if we started in this state (looking back in time). Label the original state 1, and the new state following a recombination event 2 Define E1 to be the expectation we seek, E2 the corresponding expectation for the new state: We need to consider additional potential states to form a complete system of equations. For any such state s, we can write the following, using the argument on the previous slide. If ls is the total event rate for state s: r 1. 3. 1

72 3.10 Supplement continued 2. r 1. 3. 1 We can build a graph with vertices corresponding to particular states, and rates between states. Colour positions red if an MRCA is reached. Such states have E(T1T2)=0. This allows us to construct a system of equations: 6. 1 4. 1 r/2 2. 4 r 1. 1 5. 1 1 3. 1

73 Note that the covariance decreases as the recombination rate increases.
A similar system of recursions can be calculated for n>2. In practice, the solution is extremely messy. Simulation is another approach to directly estimate the covariance in tree times (Hudson, 1983). Often, when n>2 we rely on bounding quantities of interest.

74 3.11 Number of distinct MRCAs
T(z) As we saw in the previous example, the time to the most recent common ancestor, of individual marginal trees can vary along a sequence with recombination. How many different MRCAs do we expect along a sequence? The answer is: surprisingly few. Consider a small interval [x,x+d]. With high probability there is at most one recombination event on the graph for this region: For a recombination while j ancestors, what is the probability it changes the MRCA?

75 3.11 Number of distinct MRCAs
One or other of the (coloured) recombinant edges must not coalesce with the other edges while >2 edges remain. The probability of this is combinatoric:

76 3.11 Number of distinct MRCAs
Thus the expected number of TMRCA changes in [x,x+d] is and the expected number in [0,1], letting d=1/m→0, is

77 4.0 Supplement II: Inference about recombination rate
Given variation data from a population, we seek to perform inference on processes producing data One of the most important parameters in human biology is the recombination rate Reflects the real biological process of recombination Recombination is required for meiosis to take place Recombination can cause disease when it goes wrong (by deleting, duplicating or inverting segments of the genome) Recombination keeps populations healthy, by allowing elimination of deleterious mutations Despite this, there is much we don’t know! The recombination rate Can vary hugely along a sequence Determines association between loci in the population Is hard to measure directly, because recombination occurs on average only ~1 in 100,000,000 meioses between any pair of successive nucleotides in the genome. Can be measured indirectly, by parametric analysis of variation data) Researchers in Oxford, and elsewhere, have developed such parametric approaches (Li and Stephens, 2003; Ptak et al. 2005; Hudson 2001, McVean 2002, McVean et al. 2004) One method uses the “composite likelihood” which approximates the likelihood of the data given a (variable) recombination rate, then estimates this rate using the likelihood

78 Data for ENR131, Chromosome 2q, Chinese and Japanese population sample (The International HapMap Consortium, Nature 2005)

79 4.5 Findings using the “composite likelihood” (I)
One of the challenges in human genetics is that there is a very high volume of data For example, the following is based on data for over 4 million binary mutations, typed in 270 humans from four populations There is tremendous power in the data, but analysis methods must be sufficiently fast, requiring approximation Recombination estimates for all of chromosome 12. The inferred patterns of recombination are extremely uneven (>80% of recom. in 10-20% of sequence). Over 30,000 hotspots identified genome-wide, via the composite likelihood (Myers et al. 2005).

80 4.5 Findings using the composite likelhood (II)
Downstream, one can use the places where recombination clusters – termed “hotspots” - to ask if there are features of DNA sequence that specify hotspot locations None previously identified in any mammal, but this is powerful data ~30,000 hotspots used genome-wide, and DNA sequence compared to DNA sequence of “cold” regions where there is little or no recombination It turns out there is a difference. A particular “word” in the DNA codes for there being a hotspot at a location (Myers et al. 2005, 2008) (the code is fuzzy): Since then, researchers have been able to find a new part of the cellular machinery (a “protein”, PRDM9) that recognises this word, and turns on recombination in hotspots (Myers et al. 2009, Baudat et al. 2009, Parvanov et al. 2009) PRDM9 is different in chimps, explaining their different hotspots, and has remarkable properties So: there is a close relationship between underlying biology, and variation patterns in data ...CTTCCGCCATGATTGTGAGGCCTCCCTAGCCACGTGGAACTGTGAGT...

81 4.6 Recombination summary
Recombination is a powerful, fundamental force that has shaped both our current patterns of genetic variation, and our genomes themselves The coalescent with recombination is the key model enabling us to understand the relationship between recombination and genealogical histories, and patterns in variation data Inference under this model is challenging, but creative approaches have yielded workable solutions to this problem Non-parametric and parametric approaches both have something to offer and often largely agree in findings


Download ppt "PART II: Recombination and selection"

Similar presentations


Ads by Google