Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007.
Advertisements

Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human.
Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,
Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of.
An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
An introduction to maximum parsimony and compatibility
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Population Genetics, Recombination Histories & Global Pedigrees Finding Minimal Recombination Histories Global Pedigrees Finding.
Recombination and genetic variation – models and inference
Sampling distributions of alleles under models of neutral evolution.
Preview What does Recombination do to Sequence Histories. Probabilities of such histories. Quantities of interest. Detecting & Reconstructing Recombinations.
Signatures of Selection
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
Molecular Evolution Revised 29/12/06
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover recombination in populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
ReCombinatorics: Phylogenetic Networks with Recombination CPM, June 18, 2008 Pisa, Italy Two recent results and Two Open Questions.
Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Fast Computation of the Exact Hybridization Number of Two Phylogenetic Trees Yufeng Wu and Jiayin Wang Department of Computer Science and Engineering University.
CSE182-L17 Clustering Population Genetics: Basics.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.
Phylogenetic trees Sushmita Roy BMI/CS 576
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Trees & Topologies Chapter 3, Part 1. Terminology Equivalence Classes – specific separation of a set of genes into disjoint sets covering the whole set.
Simon Myers, Gil McVean Department of Statistics, Oxford Recombination and genetic variation – models and inference.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
Gene tree discordance and multi-species coalescent models Noah Rosenberg December 21, 2007 James Degnan Randa Tao David Bryant Mike DeGiorgio.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of.
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
The Haplotype Blocks Problems Wu Ling-Yun
Yufeng Wu and Dan Gusfield University of California, Davis
Fast association mapping by incompatibilities
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
L4: Counting Recombination events
Estimating Recombination Rates
Inferring phylogenetic trees: Distance and maximum likelihood methods
ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination Dan Gusfield U. Oregon , May 8, 2012.
The coalescent with recombination (Chapter 5, Part 1)
Recombination, Phylogenies and Parsimony
Trees & Topologies Chapter 3, Part 2
Trees & Topologies Chapter 3, Part 2
Outline Cancer Progression Models
Presentation transcript:

Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut DIMACS 2008

Genealogy: Evolutionary History of Genomic Sequences Tells how sequences in a population are related Helps to explain diseases: disease mutations occur on branches and all descendents carry the mutations Genealogy: unknown. Only have SNP haplotypes (binary sequences). Problem: Inference of genealogy for “unrelated” haplotypes Not easy: partly due to recombination Sequences in current population Diseased (case) Healthy (control) Disease mutation 2

3 Recombination One of the principle genetic forces shaping sequence variations within species Two equal length sequences generate a third new equal length sequence in genealogy Spatial order is important: different parts of genome inherit from different ancestors Prefix Suffix Breakpoint

Ancestral Recombination Graph (ARG) S1 = 00 S2 = 01 S3 = 10 S4 = 10 Mutations S1 = 00 S2 = 01 S3 = 10 S4 = Recombination Assumption: At most one mutation per site

5 What is the Use of an ARG? Local trees: evolutionary history for different genomic regions between recombination breakpoints Data May look at the ARG directly. But for noisy data, another way of using ARGs: an ARG represents a set of local trees! Local tree near site 3

6 At which Local Tree Did Disease Mutations Occur? Clear separation of cases/controls: not expected for complex diseases CaseControl Possible Disease mutation

7 How to infer ARGs? But we do not know the true ARG! Goal: infer ARGs from haplotypes First practical ARG association mapping method (Minichiello and Durbin, 2006) –Use plausible ARGs: heuristic –Less complex disease model: implicitly assume one disease mutation with major effects. My results (Wu, RECOMB 2007) –Generate ARGs with a provable property, and works on a well-defined complex disease model –Focus on parsimonious history

8 Simulation Results (Wu, 2007) Comparison: TMARG (minARGs), TMARG (near minARGs), LATAG (Z. P.), MARGARITA (M. D.). TMARG (my program) and MARGRITA are much faster than LATAG. TMARG/MARGARITA: sample ARGs, decompose to local trees and look for association signals. LATAG: infer local trees at focal points. Average mapping error for 50 simulated datasets from Zollner and Pritchard

Preliminary Results: GAW16 Data Caution: more investigation needed. GAW16 data from the North American Rheumatoid Arthritis Consortium (NARAC), 868 cases and 1194 controls. Chromosome one: SNPs. Running TMARG on large-scale data Break into non-overlapping windows Run fastPHASE (Scheet and Stephens 06) to obtain haplotypes Run TMARG with Chi-square mode ? SNP rs reported in Begovich et al., 2004 and Carlton et al., 2005

10 A Related Problem Inference of Local Tree Topologies Directly (Wu, 2008, Submitted)

Inference of Local Tree Topologies 11 Recall ARG represents a set of local trees. Question: given SNP haplotypes, infer local tree topologies (one tree for each SNP site, ignore branch length) –Hein (1990, 1993) Song and Hein (2003,2005): enumerate all possible tree topologies at each site –Parsimony-based

Local Tree Topologies 12 Key technical difficulty: enumerate all tree topologies –Brute-force enumeration of local tree topologies: not feasible when number of sequences > 9 Trivial solution: create a tree for a SNP containing the single split induced by the SNP. –Always correct (assume one mutation per site) –But not very informative: need more refined trees! A: 0 B: 0 C: 1 D: 0 E: 1 F: 0 G: 1 H: 0 C E G A B D F H

How to do better? Neighboring Local Trees are Similar! Nearby SNP sites provide hints! –Near-by local trees are often topologically similar –Recombination often only alters small parts of the trees Key idea: reconstruct local trees by combining information from multiple nearby SNPs 13

RENT: REfining Neighboring Trees Maintain for each SNP site a (possibly non- binary) tree topology –Initialize to a tree containing the split induced by the SNP Gradually refining trees by adding new splits to the trees –Splits found by a set of rules (later) –Splits added early may be more reliable Stop when binary trees or enough information is recovered 14

abcdeabcde M A Little Background: Compatibility Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11. Otherwise, p and q are compatible. Easily extended to splits. A split s is incompatible with tree T if s is incompatible with any one split in T. Two trees are compatible if their splits are pairwise compatible. Sites 1 and 2 are compatible, but 1 and 3 are incompatible.

Fully-Compatible Region: Simple Case A region of consecutive SNP sites where these SNPs are pairwise compatible. –May indicate no topology-altering recombination occurred within the region Rule: for site s, add any such split to tree at s. –Compatibility: very strong property and unlikely arise due to chance. 16

Split Propagation: More General Rule Three consecutive sites 1,2 and 3. Sites 1 and 2 are incompatible. Does site 3 matter for tree at site 1? –Trees at site 1 and 2 are different. –Suppose site 3 is compatible with sites 1 and 2. Then? –Site 3 may indicate a shared subtree in both trees at sites 1 and 2. Rule: a split propagates to both directions until reaching a incompatible tree. 17

One Subtree-Prune-Regraft (SPR) Event Recombination: simulated by SPR. –The rest of two trees (without pruned subtrees) remain the same Rule: find compatible subtree T s in neighboring trees T1 and T2, s.t. the rest of T1 and T2 (T s removed) are compatible. Then joint refine T1- T s and T2- T s before adding back T s. Subtree to prune 18 More complex rules possible. ?

Simulation Hudson’s program MS (with known coalescent local tree topologies): 100 datasets for each settings. –Data much larger and perform better or similarly for small data than Song and Hein’s method. Test local tree topology recovery scored by Song and Hein’s shared- split measure  = 15  = 50 19

20 Acknowledgement More information available at: I want to thank –Dan Gusfield –Yun S. Song –Charles Langley –Dan Brown –And National Science Foundation and UConn Research Foundation