Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina Giuseppe Lancia Università di Udine.

Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina Giuseppe Lancia Università di Udine

MOLECULAR BIOLOGY Human Genome Project (1990): read and understand human (and other species) genome (DNA) Scientific and practical applications (medicine, agriculture, forensic, ….) multi-million project involving several countries

COMPUTER SCIENCE To meet the goal we need computers and programs to deal with problems such as -Huge data sets (billions of informations to be analized) -Data Interpretation (contradictory, erroneous or inconsistent data) - Data sharing and networks (online genomic data banks)

Computational (molecular) Biology Combinatorics, Discrete Math Combinatorial Optimization Integer Programming Complexity theory (Approximations and Hardness) Graph theory (but also Stringology, Data Bases, Neural Networks....) “Optimization problems arising in the analysis, interpretation and management of large sets of genomic data”

FIRST PHASE COMPLETION : HUMAN GENOME SEQUENCE - 2001 World Consortium universities and labs Celera Genomics (Craig Venter)

FIRST PHASE COMPLETION : HUMAN GENOME SEQUENCE - 2001

Computational Biology born around ’80-’90 Algorithmic approaches (e.g. Dynamic Programming for alignment) C omputational complexity (e.g. NP-hardness of folding) String-related problems, Information retrieval, Genomic data base…. ……  mostly computer scientists dominated the field

Some NP-hard problems in C.B. are OPTIMIZATION PROBLEMS These can be solved via mathematical programming techniques INTEGER LINEAR PROGRAMMING LAGRANGIAN RELAXATION SEMIDEFINITE PROGRAMMING QUADRATIC PROGRAMMING  O.R. people (and O.R. techniques) entered the field

The most important approach of this type is probably Integer Linear Programming It allows the solution of NP-hard problems via specialized Branch and Bound algorithms The lower bound comes from the Linear Programming relaxation of the model (and can be computed in polynomial time)

The I.P. approach

1.Define integer (usually binary) variables The I.P. approach

1.Define integer (usually binary) variables The I.P. approach 2. Define linear constraints that feasible solutions must satisfy

1.Define integer (usually binary) variables The I.P. approach 2. Define linear constraints that feasible solutions must satisfy 3. Define linear objective function that optimal solution must mini(maxi)mize

1.Define integer (usually binary) variables The I.P. approach 2. Define linear constraints that feasible solutions must satisfy 3. Define linear objective function that optimal solution must mini(maxi)mize 4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound

1.Define integer (usually binary) variables The I.P. approach 2. Define linear constraints that feasible solutions must satisfy 3. Define linear objective function that optimal solution must mini(maxi)mize 4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound There can be an exponential number of variables. We need to make them implicit  BRANCH-AND-PRICE

1.Define integer (usually binary) variables The I.P. approach 2. Define linear constraints that feasible solutions must satisfy 3. Define linear objective function that optimal solution must mini(maxi)mize 4. Solve by Branch and Bound. The LP relaxation (remove integrality requirements on variables) gives the bound There can be an exponential number of variables. We need to make them implicit  BRANCH-AND-PRICE There can be an exponential number of constraints. We need to make them implicit  BRANCH-AND-CUT

Some Integer Programming models in C.B. Haplotyping Clark’s rule (Gusfield ) Parsimony (Gusfield, Lancia+Pinotti+Rizzi, Brown+Harrower) Fragment assembly (Lancia+Bafna+Schwartz+Istrail) Protein folding Energy potentials (Wagner+Meller+Elber) Folding ab initio (Carr+Hart+Greenberg) Threading (RAPTOR, Xu+Li+Kim+Xu) Docking (Doye+Leari+Locatelli+Schoen, Althaus+Kohlbacher+Lenhof+Muller ) Fold comparison (Carr+Lancia+Istrail+Walenz,Caprara+Lancia) Sequence Alignment and consensus Lenhof+Vingron+Reinert Althaus+Caprara+Lenhof+Reinert Fischetti+Lancia+Serafini Meneses+Lu+Oliveira+Pardalos Physical Mapping Alizadeh+Karp+Weisser+Zweig Genome Rearrangements Caprara+Lancia

Haplotyping Clark’s rule (Gusfield ) Parsimony (Gusfield, Lancia+Pinotti+Rizzi, Brown+Harrower) Fragment assembly (Lancia+Bafna+Schwartz+Istrail) Protein folding Energy potentials (Wagner+Meller+Elber) Folding ab initio (Carr+Hart+Greenberg) Threading (RAPTOR, Xu+Li+Kim+Xu) Docking (Doye+Leari+Locatelli+Schoen, Althaus+Kohlbacher+Lenhof+Muller ) Fold comparison (Carr+Lancia+Istrail+Walenz,Caprara+Lancia) Sequence Alignment and consensus Lenhof+Vingron+Reinert Althaus+Caprara+Lenhof+Reinert Fischetti+Lancia+Serafini Meneses+Lu+Oliveira+Pardalos Physical Mapping Alizadeh+Karp+Weisser+Zweig Genome Rearrangements Caprara+Lancia We’ll see some examples

BIOLOGY 101

genome A genome is a long string over the DNA alphabet {A,C,G,T} In man it is some 3.000.000.000 letters DNA is responsible for our diversity as well as our similarity Small changes in a genome can make a big difference, like from... to...

CELL Nucleus Chromosomes (pairs) TCATCGA AGTAGCT Eukariotic diploid organisms

THE CENTRAN DOGMA OF MOLECULAR BIOLOGY attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta introns exons attag|ctatatcgttgatg|tatatgcta|cga|aatta A GENE

THE CENTRAN DOGMA OF MOLECULAR BIOLOGY attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta introns exons attag|ctatatcgttgatg|tatatgcta|cga|aatta att agc tat atc gtt gat gta tat gct acg aaa tta codon triplets R N C A S S F C W Y Q V A GENE amino acids: a PROTEIN

THE CENTRAN DOGMA OF MOLECULAR BIOLOGY attagcatggatagccgtatatcgttgatgctggataggtatatgctagatcgatggcaatta introns exons attag|ctatatcgttgatg|tatatgcta|cga|aatta att agc tat atc gtt gat gta tat gct acg aaa tta codon triplets R N C A S S F C W Y Q V A GENE amino acids: a PROTEIN The protein folds to a 3D shape to perform its function CENTRAL DOGMA: 1 gene  1 protein

From DNA to strings (i.e. reading our genome): Shotgun sequencing

amplification

fragmentation From DNA to strings (i.e. reading our genome): Shotgun sequencing amplification

TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA TAGAGATTTC TCCTAAAGAT CGCATAGATA fragmentation sequencing From DNA to strings (i.e. reading our genome): Shotgun sequencing amplification

TGAGCCTAG GATTT GCCTAG CTATCTT ATAGATA GAGATTTCTAGAAATC ACTGA TAGAGATTTC TCCTAAAGAT CGCATAGATA fragmentation sequencing From DNA to strings (i.e. reading our genome): Shotgun sequencing amplification assembly

We need to merge the fragments in order to retrieve the original DNA sequence -50,000,000 fragments -1000 char each… We better use computers! assembly ACTGCATTAGCGAGTTATAGATCGAGTAGAGATATCGCGGGG

- Take 10 copies of «Corriere della Sera» - Cut each in tiny pieces (1cm 2 ) - Put the pieces in a bag and shuffle - Grab five handful of pieces and throw them away Problem: retrieve the newspaper from the remaining pieces Let’s make the problem harder: - Ultra-tiny pieces (1mm 2 ) Difficulties : Ripeated words (e.g., «ha», «dopo», «quando», «governo»…) - It is not the CDS, but the encyclopedia Treccani (20 volumes) - It is written in chinese!! Still thr problem would be easier than sequencing the human genome Understanding the problem

Assembly Repeated words and missing words create problems to be solved by sophisticated programs, based on statistics and mathematical models. The basic underlying problem (notwithstanding the above complications) is called Shortest Superstring Problem (SSP)

Shortest superstring: Given a set of strings s 1,.., s n, find a string s that contains each s i as a substring..

cane nerone onere rene neon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg Shortest superstring:

cane nerone onere rene neon reneroneoncaneonere -> 19 caneronereneonere ->17 caneroneonerene -> 15 acct cattgt gtgcca cctg cattgtgccacctg The space of potential (non-redundant) solutions has size O(n!) The problem is NP-Hard There is an effective greedy algorithm Shortest superstring:

cane nerone onere rene neon Greedy algorithm: -Merge the two strings with max overlap. Replace them with their fusion. -Repeat until only one string is left.

cane nerone onere rene neon neronere Greedy algorithm: -Merge the two strings with max overlap. Replace them with their fusion. -Repeat until only one string is left.

cane nerone onere rene neon neronere neronerene Greedy algorithm: -Merge the two strings with max overlap. Replace them with their fusion. -Repeat until only one string is left.

cane nerone onere rene neon neronere neronerene neronereneon Greedy algorithm: -Merge the two strings with max overlap. Replace them with their fusion. -Repeat until only one string is left.

cane nerone onere rene neon neronere neronerene neronereneon caneronereneon Greedy algorithm: -Merge the two strings with max overlap. Replace them with their fusion. -Repeat until only one string is left.

In this case it found the optimum, but there is no guarantee There’s, however, a guarantee that v(GREEDY) <= 2.5v(OPT) (i.e., it is a 2.5-APPROX ALGORITHM (Sweedyk)) OPEN PROBLEM (Conjecture holding from > 20 years): Prove that GREEDY is a 2-APPROX ALGORITHM For more info see http://fileadmin.cs.lth.se/cs/Personal/Andrzej_Lingas/superstring.pdf Greedy algorithm:

Sequence Alignments Sequences evolve and change E.g.: deletion, insertion, mutation attcgattgat  attcggat deletion attcgattgat  attcggat insertion attcg  atgcg mutation Given two genomic sequences (e..g., man and mouse) we want to compare them

Languages evolve as well Mater Madre Mother Matre Mutter

Mater Madre Mother Matre Mutter Good alignment: M A T - E R M O T H E R Languages evolve as well

Mater Madre Mother Matre Mutter Good alignment: M A T - E R M O T H E R Bad alignment: - - M A T - E R M O T H - E R - Languages evolve as well

Mater Madre Mother Matre Mutter Good alignment: M A T - E R M O T H E R Bad alignment: - - M A T - E R M O T H - E R - Multiple alignment: M A T - E R - M O T H E R - M A D - - R E M A T - - R E M U T T E R - Languages evolve as well

Mater Madre Mother Matre Mutter Good alignment: M A T - E R M O T H E R Bad alignment: - - M A T - E R M O T H - E R - Multiple alignment: M A T - E R - M O T H E R - M A D - - R E M A T - - R E M U T T E R - If DNA sequences: A C T - G G - A C T C G G - A G T - - C T C C T - - G T A - T T C G - Languages evolve as well

Mater Madre Mother Matre Mutter Good alignment: M A T - E R M O T H E R Bad alignment: - - M A T - E R M O T H - E R - To see if an alignment is good: score 0 identical (or similar) letters (e.g. T-D, B-P, S-F) and score 1 different or deleted letters. Cost: 1Cost: 8 Languages evolve as well

Mater Madre Mother Matre Mutter Good alignment: M A T - E R M O T H E R Bad alignment: - - M A T - E R M O T H - E R - To see if an alignment is good: score 0 identical (or similar) letters (e.g. T-D, B-P, S-F) and score 1 different or deleted letters. Cost: 1Cost: 8 There is an exponential number of possible alignments of two strings, but finding the optimal one is easy (i.e. polynomial). We’ll see it next. For many strings, there is no polynomial algorithm (if P =/= NP), but there are effective heuristics and approximation algorithms. Languages evolve as well

The basic problem The basic problem : pairwise sequence alignment: Seq. 1: ATCGGCTTGTTATTG Seq. 2: ATGGGATTTAATTGCCC ATCGGCTTGTTA-TTG--- ATGGGAT--TTAATTGCCC Alignment:

The basic problem The basic problem : pairwise sequence alignment: Seq. 1: ATCGGCTTGTTATTG Seq. 2: ATGGGATTTAATTGCCC ATCGGCTTGTTA-TTG--- ATGGGAT--TTAATTGCCC Alignment: Cost c(x,y) for each pair x, y in {A, T, C, G, -} Problem Problem: find the arrangement minimizing  i c(AL[1,i], AL[2,i]) This value is the edit distance d(S’, S”)

Simple Dynamic Programming Recursion: d(S’[1…n], S”[1..m]) = min { d(S’[1..n-1],S”[1…m]) + c(S’[n],-) d(S’[1..n],S”[1…m-1]) + c(-, S”[m]) d(S’[1..n-1],S”[1…m-1]) + c(S’[n],S”[m]) d(ATCCG, ATTC) = min { d(ATCC, ATTC) + c(G,-) d(ATCCG, ATT) + c(-, C) d(ATCC, ATT) + c(G,C) Cost: O(n m) (Smith & Waterman, ’86)

Multiple Sequence Alignment This can be extended to Multiple Sequence Alignment Useful for: Finding conserved patterns (functionally relevant) Clustering genomic sequences (e.g. protein families) Evolutionary studies (e.g. intra-species comparisons) …. ATCCGTTAGTTA---GATTG--- ATT--TTAGTTA---GATTGCAC ATTCGTTAGTTAGTAGATTGCA- TTTCGTTA--TA-TACATTGCAC

Multiple Sequence Alignment This can be extended to Multiple Sequence Alignment Useful for: Finding conserved patterns (functionally relevant) Clustering genomic sequences (e.g. protein families) Evolutionary studies (e.g. intra-species comparisons) …. Unfortunately…it’s NP-hard (Wang & Jiang, ’94) (Dynamic Programming costs O(2 n ) for k sequences of length n) kk ATCCGTTAGTTA---GATTG--- ATT--TTAGTTA---GATTGCAC ATTCGTTAGTTAGTAGATTGCA- TTTCGTTA--TA-TACATTGCAC The distance in the alignment A is denoted d A (S’,S”). We wish to minimize SP(A) =  {S’,S”} d A (S’,S”) (sum-of-pairs objective)

a:humanafrican cox1 ;----accatccatcaatctgcgatccccctccaataccaaccataccactcctggccttccagtt b:humprobeuro cox1;gtacaccatccatcaa--tgcgatccccctccaataccaaccataccactcctggccttccagtt c:gorilla cox1;acatattacccatcaacttataatctctttccagcatcagtcatgctattcccgaccccttagtt d:gibbon cox1;gcgtattactcgctggcccatacctcttcctagaccccaactgtgtcgcctctgattccctagtt e:chimpanzee cox1;acacactacctatcgacttacaattccctcccaacaccagtcatgctgttcccgacccctcaact f:pigmychimp cox1;---cactacccattgacttacaatcccttctcagcactaatcacgctgttcccgatccctcaact g:baboon cox1;atgcaccgtttactagctcataccttccccta--ccctgatcgcatcatctttaatcccctaacc h:orangutan cox1;gcgcgctgtccatcatcccacacttctctcccaacattggctataccgtccttgactccctcatc i:sumorang cox1;gcgcgccgtccgtcatcccacacctctctcccaacatcaactataccacccctaactccctcgtc agccccccccctagctgcctcatctccacccgcaccctcttctcccccccccaaacctcctctttac--- agccccccccctagctgcctcatctccacccgcaccctcttctcccccccccaaacctcctctttacttt aatcctccttccaatcgccctacctccacccgcactttctttctctccccccaaattcctttctcacctt gaccctcctttcggccgttctcccccctctcgcatcccaccttcccccttttaaactctccccttactcc Aacgtcccccctgacca--ccattttcacccgcactcccttcctttcctcccgaacttttttcttatctt aatgtcccccctggccactccactttcacccgcattcccttcctttcctcctgaacttttttctcatctt ggtccccccctcaaccaccctcccccatttaatatctcaccctctcctttccaagcttccccctcatccc Aactcttacctcaatcgtctta— tca-tcaacgcccccctcccccattattatgttccttctccccccc aactcttaccttaaccatctcatttcatccagtgcccccctccctcaccatcatatcctctttccccccc Example: Same protein, different species

Phylogenetic Tree Sometimes the sequences can be related via a Phylogenetic Tree A tree can be used to align sequences. Here is how:

Theorem 1: given a fully labeled tree, there is a multiple alignment A(T) such that d A(T) (S(i),S(j)) = d(S(i), S(j)) for all edges (i,j) of T Proof (by 1 drawing): ATGGC TTGGAC ATG-GC TTGGAC-- -TTGGAC AT-GG-C -TTG-GAC-- AT-G-G-C--

A 2-approximation algorithm for multiple alignment and a good heuristic borrow concepts from Network Design 3 2 4 10 59 2 7 6 3 The Routing Cost of a tree is the sum of distances in the tree between all pairs X Y E.g. rc(X,Y) = 3 + 5 + 9 + 2 = 19. rc(T) = 1616 1

If we have k sequences, and a tree connecting them, we can align them using Theorem 1 tccat attcgg ttg ttatcg ttga attcg ttgat tattc ctt tcct ctggat aatgt The cost of edges {S’,S”} is the edit distance d(S’S”)

If we have k sequences, and a tree connecting them, we can align them using Theorem 1 tccat attcgg ttg ttatcg ttga attcg ttgat tattc ctt tcct ctggat aatgt By triangle inequality, d A (S’,S”)   {i,j}in path S’,..,S” d A (i,j) d A (ttatcg, ctt)  d A (ttatcg, ttga) + d A (ttga, attcg) + d A (attcg,ctt) = = d(ttatcg, ttga) + d(ttga, attcg) + d(attcg,ctt) Hence SP(A(T))  rc(T) Stars already yield a 2-approximation (next). If T* minimizes rc(T), we can expect a low SP(A(T*)). Finding T* is NP-hard, but “easier” than solving min SP The cost of edges {S’,S”} is the edit distance d(S’S”)

Some particular trees (STARS) give a 2-approximation to SP c

ILP for Minimum Routing Cost Tree ILP for Minimum Routing Cost Tree (Fischetti, L, Serafini) Given G=(V,E), edge lengths d. Use 0-1 variables for each path

ILP for Minimum Routing Cost Tree ILP for Minimum Routing Cost Tree (Fischetti, L, Serafini) Given G=(V,E), edge lengths d. Use 0-1 variables for each path (i)

ILP for Minimum Routing Cost Tree ILP for Minimum Routing Cost Tree (Fischetti, L, Serafini) Given G=(V,E), edge lengths d. Use 0-1 variables for each path (i) (ii)

ILP for Minimum Routing Cost Tree ILP for Minimum Routing Cost Tree (Fischetti, L, Serafini) Given G=(V,E), edge lengths d. Use 0-1 variables for each path (i) (ii) (iii)

ILP for Minimum Routing Cost Tree ILP for Minimum Routing Cost Tree (Fischetti, L, Serafini) Given G=(V,E), edge lengths d. Use 0-1 variables for each path (i) (ii) (iii) Pricing becomes O(n^2) shortest path problems with lengths d e + v e ( v e dual vars for (ii))

ILP for Minimum Routing Cost Tree ILP for Minimum Routing Cost Tree (Fischetti, L, Serafini) Given G=(V,E), edge lengths d. Use 0-1 variables for each path (i) (ii) (iii) Best tree has RC 10% less than RC of best star “ “ “ SP 6% “ “ SP “ “, SP of star is experimentally 10-15% of optimal SP of best tree is 4-10% of optimal

Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina Giuseppe Lancia Università di Udine.

Similar presentations

Presentation on theme: "Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina Giuseppe Lancia Università di Udine."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina Giuseppe Lancia Università di Udine.

Similar presentations

Presentation on theme: "Modelli e Metodi di Ottimizzazione per la Biologia Computazionale e la Medicina Giuseppe Lancia Università di Udine."— Presentation transcript:

Similar presentations

About project

Feedback