Parsimony population haplotyping

Slides:

Advertisements

Similar presentations

Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

Advertisements

1 LP Duality Lecture 13: Feb Min-Max Theorems In bipartite graph, Maximum matching = Minimum Vertex Cover In every graph, Maximum Flow = Minimum.

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

Sharlee Climer, Alan R. Templeton, and Weixiong Zhang

June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

Computability and Complexity 23-1 Computability and Complexity Andrei Bulatov Search and Optimization.

Instructor Neelima Gupta Table of Contents Lp –rounding Dual Fitting LP-Duality.

1 Optimization problems such as MAXSAT, MIN NODE COVER, MAX INDEPENDENT SET, MAX CLIQUE, MIN SET COVER, TSP, KNAPSACK, BINPACKING do not have a polynomial.

WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.

Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.

Approximation Algorithms

L6: Haplotype phasing. Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome.

Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.

Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.

Haplotyping via Perfect Phylogeny: A Direct Approach

Combinatorial Approaches to Haplotype Inference Dan Gusfield CS, UC Davis.

Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.

Computability and Complexity 24-1 Computability and Complexity Andrei Bulatov Approximation.

Integer Programming Difference from linear programming –Variables x i must take on integral values, not real values Lots of interesting problems can be.

Linear Programming and Parameterized Algorithms. Linear Programming n real-valued variables, x 1, x 2, …, x n. Linear objective function. Linear (in)equality.

Approximation Algorithms: Bristol Summer School 2008 Seffi Naor Computer Science Dept. Technion Haifa, Israel TexPoint fonts used in EMF. Read the TexPoint.

Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity.

National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.

Combinatorial Problems for Human Polymorphisms Giuseppe Lancia University of Udine.

BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos.

Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.

Design Techniques for Approximation Algorithms and Approximation Classes.

Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.

Approximation Algorithms Department of Mathematics and Computer Science Drexel University.

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

1 Efficient Haplotype Inference on Pedigrees and Applications Tao Jiang Dept of Computer Science University of California – Riverside (joint work with.

Approximation Algorithms

Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.

Computer Science Day 2013, May Distinguished Lecture: Andy Yao, Tsinghua University Welcome and the 'Lecturer of the Year' award.

National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

Minicourse on parameterized algorithms and complexity Part 4: Linear programming Dániel Marx (slides by Daniel Lokshtanov) Jagiellonian University in Kraków.

Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi University of Udine.

Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics.

Lecture.6. Table of Contents Lp –rounding Dual Fitting LP-Duality.

CSE 6311 – Spring 2009 ADVANCED COMPUTATIONAL MODELS AND ALGORITHMS Lecture Notes – March 12, 2009 ILP – Integer Linear Programming Approximate algorithm.

Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs

National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.

Approximation Algorithms by bounding the OPT Instructor Neelima Gupta

Approximation Algorithms Duality My T. UF.

Approximation Algorithms based on linear programming.

National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.

International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.

Optimization Problems for Polymorphisms of Single Nucleotides.

Approximation algorithms

Yufeng Wu and Dan Gusfield University of California, Davis

Introduction to SNP and Haplotype Analysis

Approximation algorithms

1.3 Modeling with exponentially many constr.

Introduction to SNP and Haplotype Analysis

Chapter 6. Large Scale Optimization

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

1.3 Modeling with exponentially many constr.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Outline Cancer Progression Models

On solving population haplotype inference problems

Approximation Algorithms for the Selection of Robust Tag SNPs

Practical Algorithms for the Single Individual SNP Haplotyping Problem

Approximation Algorithms for the Selection of Robust Tag SNPs

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Chapter 6. Large Scale Optimization

Presentation transcript:

Parsimony population haplotyping Giuseppe Lancia University of Udine Romeo Rizzi, Cristina Pinotti University of Trento

Polymorphisms A polymorphism is a feature - common to everybody - not identical in everybody - the possible variants (alleles) are just a few E.g. think of eye-color

Polymorphisms A polymorphism is a feature - common to everybody - not identical in everybody - the possible variants (alleles) are just a few E.g. think of eye-color Or blood-type for a feature not visible from outside

At DNA level, a polymorphism is a sequence of nucleotides varying in a population.

Single Nucleotide Polymorphism (SNP) At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP)

Single Nucleotide Polymorphism (SNP) At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP) atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac

Single Nucleotide Polymorphism (SNP) At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence Single Nucleotide Polymorphism (SNP) atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

HAPLOTYPE: chromosome content at SNP sites HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

HAPLOTYPE: chromosome content at SNP sites HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgtac atcggattagttagggcacaggacgt atcggcttagttagggcacaggacgtac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggattagttagggcacaggacggac atcggcttagttagggcacaggacggac

HAPLOTYPE: chromosome content at SNP sites HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites ct ag cg at at at ct ag ag cg ag ag ag cg

HAPLOTYPE: chromosome content at SNP sites HOMOZYGOUS: same allele on both chromosomes HETEROZYGOUS: different alleles HAPLOTYPE: chromosome content at SNP sites GENOTYPE: “union” of 2 haplotypes ct OcE ag cg at OaE at OaOt at ct EE ag ag EOg cg ag ag OaOg OgE ag cg

CHANGE OF SYMBOLS: each SNP only two values in a population. Call them 0 and 1. Also, call 2 the fact that a site is heterozygous HAPLOTYPE: string over 0,1 GENOTYPE: string over 0,1,2 ct OcE ag cg at OaE at OaOt at ct EE ag ag EOg cg ag ag OaOg OgE ag cg

CHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio). Call them 0 and 1. Also, call 2 the fact that a site is heterozygous HAPLOTYPE: string over 0,1 GENOTYPE: string over 0,1,2 01 02 10 00 11 12 11 11 11 01 22 10 10 20 00 10 10 10 20 10 00

CHANGE OF SYMBOLS: each SNP only two values in a poplulation (bio). Call them 0 and 1. Also, call 2 the fact that a site is heterozygous 0 + 0 = --- 1 + 1 = --- 1 0 + 1+ 1 = 0= --- --- 2 2 RULES: 01 02 10 00 11 12 11 11 11 01 22 10 10 20 00 10 10 10 20 10 00

The haplotype reconstruction problem (from genotypes – which are much cheaper to obtain than haplotypes)

The “biological” problem… 011 101 011 000 010 001 111 010 011

The “biological” problem… 011 101 011 000 010 001 111 010 011

The “biological” problem… 011 000 011 101 010 001 111 010 011

The “biological” problem… #*&$$# !!! 011 000 011 101 010 001 111 010 011

The “biological” problem… #*&$$# !!! 011 000 011 101 010 001 010 101 010 101 111 010 011 010 111

The “biological” problem… 011 000 011 101 010 001 010 101 010 101 111 010 011 010 111

The “biological” problem… 011 000 011 101 010 001 010 101 010 101 111 010 011 010 111

The “biological” problem… 011 000 011 101 010 001 111 010 101 010 101 010 011 010 111

The “biological” problem… 011 000 011 101 010 001 111 *$**$& !!! 010 101 *&X*# !!! 010 101 010 011 010 111

The “biological” problem… 011 000 011 101 010 001 111 011 111 *$**$& !!! 011 111 000 111 010 101 *&X*# !!! 010 101 010 011 010 111 010

The “biological” problem… 011 000 011 101 010 001 111 011 111 011 111 000 111 010 101 010 101 010 011 010 111 010

The “biological” problem… 011 000 010 001 111 011 111 011 101 011 111 010 101 000 111 010 101 010 011 010 111 010

The “biological” problem… 011 000 010 001 111 011 111 011 111 010 101 011 101 010 101 010 011 000 111 010 111 010

The “biological” problem… 011 000 010 001 111 011 111 011 111 010 101 011 101 010 101 010 011 010 111 000 111 010

The “biological” problem… 011 000 010 001 111 011 111 011 111 010 101 011 101 011 010 101 010 000 010 011 010 111 010 111 000 111 010

The “biological” problem… 011 000 010 001 111 011 111 011 111 010 101 011 101 011 010 101 010 000 010 011 010 111 010 111 000 111 010

The “biological” problem… 011 000 010 001 111 011 111 010 011 011 111 010 101 011 010 011 101 011 011 000 010 101 010 000 010 011 010 111 010 111 000 111 010 000

We observe GENOTYPES 011 000 022 010 001 111 011 111 022 010 011 111 221 012 011 111 010 101 211 011 010 222 011 101 012 011 011 000 221 011 022 010 101 010 000 010 011 010 111 222 020 012 212 010 111 000 111 000 212 010 222 000 010

We observe GENOTYPES 022 022 111 221 012 211 222 012 221 011 022 222 020 012 212 212 222 000 010

PROBLEM: given input GENOTYPE data 21221 11221 11011 22221 00011 INPUT: G = { 11221, 22221, 11011, 21221, 00011}

PROBLEM: given input GENOTYPE data 11011 01101 21221 11011 11101 11221 11011 11011 00011 11101 22221 00011 00011 INPUT: G = { 11221, 22221, 11011, 21221, 00011} OUTPUT: H = { 11011, 11101, 00011, 01101} Each genotype is explained by two haplotypes OBJ: The cardinality of H is MINIMUM (Parsimony, aka Okkam’s razor)

Other objectives for reconstruction: Clark’s inference rule (Gusfield, JCB 2001) -Solution fits a perfect phylogeny (Eskin, Halperin, Karp, JBCB 2003) (Bafna, Gusfield, Lancia, Yooseph, JCB 2003)

MENU: Prove problem is difficult Give an exact algorithm (ILP) Give approximation algorithms Dessert

1. The problem is APX-Hard Reduction from VERTEX-COVER on graphs G=(V,E) for which (thanks to a theorem by Nemhauser and Trotter, 1975)

B A C D E

A B C D E * B A C D E

A B C D E * AB BC AE DE AD B A C D E

A B C D E * AB BC AE DE AD A B C D E B A C D E

A B C D E * AB 2 2 BC 2 2 AE 2 2 DE 2 2 AD 2 2 A B C D E B A C D E

A B C D E * AB 2 2 BC 2 2 AE 2 2 DE 2 2 AD 2 2 A 0 B 0 C 0 D 0 E 0 B A C D E

A B C D E * AB 2 2 2 BC 2 2 2 AE 2 2 2 DE 2 2 2 AD 2 2 2 A 0 0 B 0 0 C 0 0 D 0 0 E 0 0 B A C D E

A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0 B A C D E

A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0 B A C D E G = (V,E) has a node cover of X size k  there is a set H of |V| + k haplotypes that explain all genotypes

A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0 B A C D E G = (V,E) has a node cover of X size k  there is a set H of |V| + k haplotypes that explain all genotypes

A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0 A’ 0 1 1 1 1 1 B’ 1 0 1 1 1 1 E’ 1 1 1 1 0 1 B A C D E G = (V,E) has a node cover of X size k  there is a set H of |V| + k haplotypes that explain all genotypes

A B C D E * AB 2 2 1 1 1 2 BC 1 2 2 1 1 2 AE 2 1 1 1 2 2 DE 1 1 1 2 2 2 AD 2 1 1 2 1 2 A 0 1 1 1 1 0 B 1 0 1 1 1 0 C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0 A’ 0 1 1 1 1 1 B’ 1 0 1 1 1 1 E’ 1 1 1 1 0 1 B A C D E It can be shown that a (1 + e)- approximation for Haplotyping would imply a (1 + 3e)- approximation for Vertex Cover

2. An exact algorithm based on Integer Linear Programming

Expand your input G in all possible ways 220 022 120

Expand your input G in all possible ways 220 022 120 010 + 100, 000 + 110 000 + 011, 001 + 010 100 + 110

Expand your input G in all possible ways 220 022 120 010 + 100, 000 + 110 000 + 011, 001 + 010 100 + 110 This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110}

Expand your input G in all possible ways 220 022 120 010 + 100, 000 + 110 000 + 011, 001 + 010 100 + 110 This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110} Define a 01 variable h in H(G) and a 01 variable pair that explains a g in G (the set PAIR(g))

OBJ: min Expand your input G in all possible ways 220 022 120 010 + 100, 000 + 110 000 + 011, 001 + 010 100 + 110 This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110} Define a 01 variable h in H(G) and a 01 variable pair that explains a g in G (the set PAIR(g)) OBJ: min

Provided that: Expand your input G in all possible ways 220 022 120 010 + 100, 000 + 110 000 + 011, 001 + 010 100 + 110 This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110} Define a 01 variable h in H(G) and a 01 variable pair that explains a g in G (the set PAIR(g)) Provided that:

and that: Expand your input G in all possible ways 220 022 120 010 + 100, 000 + 110 000 + 011, 001 + 010 100 + 110 This is the set of haplotypes obtained H(G) = {000, 001, 010, 011, 100, 110} Define a 01 variable h in H(G) and a 01 variable pair that explains a g in G (the set PAIR(g)) and that:

The resulting Integer Program: minimize

-ILP problem can be solved by Branch and Bound, within a time depending on Many factors -This IP model can have a lot of variables and constraints -Some tricks can be used to reduce the n. of var/cons (do not def. vars for haplotypes that apply only to one pair)

-ILP problem can be solved by Branch and Bound, within a time depending on Many factors -This IP model can have a lot of variables and constraints -Some tricks can be used to reduce the n. of var/cons (do not def. vars for haplotypes that apply only to one pair) Simulator by R. Hudson (coalescent theory) to simulate haplotypes (w/level of recombination r = 0, 4, 16, 40) 50 Individuals, 10 and 30 SNPs sites (use ILOG CPLEX) Compare with PHASE At levels r <= 16 results same as PHASE (correctness depended on r, r= 0 both are 98-100% correct) For 50 individuals, 10 sites and r=40, correctness in 75-95%

-ILP problem can be solved by Branch and Bound, within a time depending on Many factors -This IP model can have a lot of variables and constraints -Some tricks can be used to reduce the n. of var/cons (do not def. vars for haplotypes that apply only to one pair) 15 instances on 30 sites, r=0, size of ILP very variable From 300 vars (0.03 secs) to 135,000 vars (2.5 mins) to 10^6 vars (no optimal found within 30 mins) -Most had 10,000 vars, solved in under 2mins Solved 13/15, accuracy 80-96%. PHASE took much more time to achieve no better accuracy. REDUCTION VARS: 50 indiv, 30 SNPs, r=4, vars: 28,580  5418 “ 40 “ r=16 “ 548,352  129,812 increasing r makes problem simpler (but model less accurate)

3. An approximation algorithm based on Integer Linear Programming and rounding

LINEAR PROGRAMMING RELAXATION: OPT := min

LINEAR PROGRAMMING RELAXATION: LP := min

LINEAR PROGRAMMING RELAXATION: LP := min Clearly, LP <= OPT

LP := min LP ROUNDING TO INTEGER: Assume each genotypes has at most k sites “2”

LP := min LP ROUNDING TO INTEGER: Assume each genotypes has at most k sites “2” Then, each g gives rise to haplotypes, and

LP := min LP ROUNDING TO INTEGER: Assume each genotypes has at most k sites “2” Then, each g gives rise to haplotypes, and The above LP can be solved in POLYNOMIAL TIME

LP := min LP ROUNDING TO INTEGER: Let x* be the optimal (possibly fractional) LP-solution

LP := min LP ROUNDING TO INTEGER: Let x* be the optimal (possibly fractional) LP-solution For each h in H(G), take h in solution S iff

LP := min LP ROUNDING TO INTEGER: Let x* be the optimal (possibly fractional) LP-solution For each h in H(G), take h in solution S iff |S| <= 2^(k-1) LP <= 2^(k-1) OPT

LP := min LP ROUNDING TO INTEGER: Solution is feasible, since, for each g, And hence at least one of

LP := min LP ROUNDING TO INTEGER: Solution is feasible, since, for each g, And hence at least one of This implies also and

Sumarizing: there is a 2^(k-1) – approximate algorithm for the case in which each genotype has at most k heterozygous sites We also have a probabilistic, 2^(k+2) – approximate algorithm which does not use Linear Programming

TO DO Better exact algorithms(e.g. Combinatorial Branch and Bound) Better approximation algorithm (not depending on k, or w/better dependance on k. BTW, any greedy algorithm is a -approximation for n genotypes)

BYE, EVERYBODY!