Enumerating, Sampling and Counting: some illustrative cases in biology Enumerating discrete structures Sampling and searching Nested sampling for counting.

Enumerating, Sampling and Counting: some illustrative cases in biology Enumerating discrete structures Sampling and searching Nested sampling for counting Olivier Martin Laboratoire de Physique Théorique et Modèles Statistiques et UMR de Génétique Végétale University of Paris-Sud

[ I ]: Enumerating Discrete Structures Illustrative case: trees describing pedigrees O. Martin and F. Hospital, Genetics (2004) In a breeding program, one wants to (optimally) cross a collection of “parents” to produce an ideal genome, but the mixing of the genes (Mendelian genetics) is probabilistic and depends on their mutual distances. General framework: each individual in the parental population has one good gene (resistance to one disease) and the “ideotype” must accumulate all these into one genome. The crossing of 2 parents should pass on their good genes to at least one offspring.

Transmission of genes H (1)(2) s 1 =1 s 2 =2 s = 1,2 H (3)(4) s 1 =3 s 2 =4 s = 3,4 H (12)(34) s 1 =1,2 s 2 =3,4 s = 1,2,3,4 We impose that a gamete cumulate all the good genes of the 2 chromosomes of its parent

Example of a simple pedigree P1, P2, P3: founder parents I* : Ideotype

Pedigrees differ by:  A tree structure  The choice of parents Representation of a pedigree P1P1 P2P2 P3P3 P4P4 P1P1 P2P2 P3P3 P4P4 P1P1 P4P4 P2P2 P3P3

Particular cases of pedigrees Min height = log 2 (n) = 3 Max height = (n -1) = 7 Regular pyramid Cascade

Pedigree = binary leaf-labeled tree H (1)(2) H (3)(4) H (5)(6) H (12)(34) H (1234)(56) P1P2P3P4P5P6 Level 0 Level 1 Level 2 Level 3 Leaves Node

Questions How to count the number of distinct pedigrees? How to computer enumerate them for further use? How to sample them uniformly? How to find the «optimal » pedigree given that each pedigree has a cost?

Counting the number of pedigrees n34 58 10 20 A(n)315105 1351353.4 x 10 7 8.2 x 10 21 For n genes, one has A(n)=(2n - 3)!! pedigrees (by recurrence equations)

Enumeration of all pedigrees p genesn-p genes Sub-pedigree A pedigree cumulating n genes One fuses two sub-pedigrees: - cumulating p genes - cumulating (n-p) genes

An algorithm for constructing all pedigrees Examine all pairs of sub-pedigrees {P1,P2} of height h1=h et h2≤h If P1 et P2 have no good gene in common, fuse them to form a sub-pedigree P of height (h+1) If P cumulates all good genes, keep it, otherwise add it to the list of sub-pedigrees of height h+1 Suppose all sub-pedigrees of height at most h are known; one can generate all those of height h+1: Repeat for the next height until h+1 = n-1

Working of the algorithm h=0

Working of the algorithm... h=1 h=0

Working of the algorithm... h=1 h=0 etc...

Working of the algorithm... h=1 h=0 h=2 etc...

Working of the algorithm... h=1 h=0 h=2 etc... h=3

Example : cascade with 4 genes

Optimal pedigrees: search by pruning the enumeration (branch and bound) Of all the ways to produce a given combination of good genes, keep only the best sub-pedigree Enumeration: one can treat up to 14 genes, Branch and bound: up to 22 genes. Case of « adjacent » cascades : dynamic programming determines the optimal pedigree in O(n 2 ) operations

[ II ]: Sampling and searching This problem is ubiquitous: Physics: equilibrium configurations Operations research: feasible solutions of CSP Statistics: estimating p-values

La voie royale: Monte Carlo Markov Chains To obtain samples with a given probability distribution or measure, use the Metropolis algorithm (1953) Simple, very effective if no bottlenecks If the measure is fragmented, one needs large « moves » but that almost always fails

The case of biological networks: some computational challenges (1) Generate a genotype of given phenotype (oriented search) (2) Sample uniformly genotypes of a given phenotype: use symmetries to reduce exponentially the space size (3) Determine the connectivity of the neutral network: do guided search to go from one random genotype to another (4) Sample uniformly a connected component of the neutral network: use random walks (5) Sample uniformly the surface of a “ball” around a point: use Metropolis with asymmetric rates (6) Get the infinite population limit of a population under Darwinian selection: use variance reduction and 1/N extrapolation

Viable genotypes are rare S. Ciliberti, O. Martin and A. Wagner, Plos Comp. Bio. (2007) If one allows for M interactions (M non-zero entries of W) between N genes, what fraction of the genotypes (regulatory networks) are viable? By smart sampling: Illustration when M = 0.25 N 2

Showing connectivity properties of biological networks We want to check with a high level of confidence that a certain space S is connected. We do this in three steps: Use the Metropolis MC algorithm to produce random pairs of points (P1,P2) in the space S Generate an “equilibrium” cloud of points in S around P1 by a biased Monte Carlo and store these Produce a MC chain of points in S, starting from P2, using for instance the same Monte Carlo rule as above; check for collisions with the stored set. If a collision arises, P1 is connected to P2

The viable genotypes form a connected network Very few viable networks are not in the giant connected component, and the few such networks are usually isolated. Example: For M=0.25 N 2, the fraction of viable networks not belonging to the giant component is: 2.3×10 -3 at N=8 1.7×10 -3 at N=12 1.4×10 -3 at N=20

Structure in the neighborhood of a viable genotype

Neutral network topology S. Ciliberti, O. Martin and A. Wagner, PNAS (2007)

Constructive samplers When the measure is fragmented, resort to creating samples ab-initio and use weights Need to « guide » the construction, otherwise weights have huge variance Some cases are « easy » (Sinclair et al.): Polynomial Randomized Approximation Scheme Some difficult cases have been treated (PERM of Grassberger) but it is an art

Other samplers Choose at random a sufficiently small sub-regions and apply branch and bound in each to get configurations (very slow) Perform nested sampling (multiple measures interpolating to the desired one) Accept incorrect distribution and just get « some » configurations by guided stochastic search; this is OK in the context of search or “design”

The mutational robustness and our measure Q have a strong association What makes a regulatory network robust and how can one « design » functional networks ? Q is a « quality » factor which measures the synergy of the W ij S j

[ III ]: Nested sampling for counting Sometimes it is not enough to sample feasible solutions, one may want to know their number or frequency… Physics: entropy Statistics: small p-values Operations research: size of set of feasible solutions of CSP Biology: computing neutral network sizes

Nested sampling In a discrete space, we want to sample configurations having an unusual property, forming a fraction of say 1 in a trillion… Randomly sampling the full space won't do Often Monte Carlo won't work because the desired sub-space is fragmented Introduce a family of measures interpolating between the full space and the desired sub-space and use exchange Monte Carlo on the replicas

Example: cardinality of ‘neutral’ network in RNA modeling T. Jorg, O. Martin and A. Wagner, submitted to BMC Bioinformatics Discrete space of sequences, only a tiny fraction have the correct folding… Changing just a bit the sequence sometimes changes the folding a lot, so space is fragmented A simple choice for the measures: increasing distances to the target fold. At very short distances the measure is fragmented, but use of larger distances restores connectivity, thereby allowing the use of the Metropolis approach. Even with this simple choice, one can efficiently sample the space of interest uniformly in spite of its rarity. Extra bonus: one can both sample and count stochastically, in contrast to standard Monte Carlo.

Some conclusions In the most favourable cases, one can enumerate, sample, search (design/optimize) and count. Sophisticated algorithmic approaches based on Markov Chains allow one to sample even in intricate spaces, though at a significant computational cost. The use of nested sampling allows for approximate counting in many realistic cases. Except for enumeration, these techniques are perfectly applicable to continuous spaces.

Enumerating, Sampling and Counting: some illustrative cases in biology Enumerating discrete structures Sampling and searching Nested sampling for counting.

Similar presentations

Presentation on theme: "Enumerating, Sampling and Counting: some illustrative cases in biology Enumerating discrete structures Sampling and searching Nested sampling for counting."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Enumerating, Sampling and Counting: some illustrative cases in biology Enumerating discrete structures Sampling and searching Nested sampling for counting.

Similar presentations

Presentation on theme: "Enumerating, Sampling and Counting: some illustrative cases in biology Enumerating discrete structures Sampling and searching Nested sampling for counting."— Presentation transcript:

Similar presentations

About project

Feedback