The Thirties Algorithms-formalization…Turing, Church, Kleene, Godel, Post…. a non-ambiguos ordered finite sequence of steps, each effectively excetuable in finite time, producing a result in finite time
The forties again Claude Shannon- The Birth of Information Theory …and data compression The guy is quite a character, please visit: https://www.youtube.com/watch?v=G5rJJgt_5mg
the fab sixties Not only algorithms but Fast algorithms and formal methodologies for their design and analysis: Knuth, Tarjan, Hopcroft
The seventies NP-Completeness:Not all problems seem to admit an efficient algorithmic solution …and Computational Biology has plenty of examples Edmonds, Cook, Karp
Discrete Algorithms Discrete mathematical objects: good models to represent computational problems Example: Graphs
Discrete Algorithms Discrete mathematical objects: Efficient organization of information Example : Trees
Discrete Algorithms How to establish the performance of an algorithm: Models of computers, Hardware, etc.
Discrete Algorithms How to establish the performance of an algorithm: Models of computers, Hardware, etc. Here: The “real thing”: The Turing Machine or equivalent
Discrete Algorithms and Bioinformatics -Do we need more algorithms? Pubmed search: 21471 papers [1991, 2014] Scopus: 101178 papers (Biochemistry, Genetics, Molecular Biology) We need GOOD ALGORITHMS
Discrete Algorithms and Bioinformatics Good Algorithms Fast and memory efficient, i.e., process growing amounts of data in “reasonable” time and little space Accurate, i.e., able to identify useful biological information in terms of function and/or structure
Descrete Algorithms and Bioinformatics Good Algorithms Accurate (evaluation): THE BIOLOGIST A physical person A Statistician, i.e., statistical analysis Surprising or unexpected “events” are related to “biologically useful” information Example: BLAST, Transcription factors binding sites Benchmark data sets They offer solutions, validated by experts, one can compare against Examples: CASP, DREAM, MSA NOT AVAILABLE IN MANY CRUCIAL DOMAINS
Discrete Algorithms and Bioinformatics Good Algorithms The statistician: care must be exercised…(ahi ahi ahi, no Alpitour) Towards epistemological foundations of statistical methods for high-dimensional biology Mehta et al., Nature Genetics 2004 Exponential growth of statistical methods for microarrays analysis For many of them, it is unclear what they do and why they are needed: they are defined as Questionable
Discrete Algorithms and Bioinformatics -Good Algorithms Time Space Let’s Take a Global Look: Processors Power (MIPS) External Disk Capacity (MB) Sequencing Capacity (kb per day) Transmission costs are not counted
Discrete Algorithms and Bioinformatics -On the future of genomic data [Kahn11] and good algorithms (time, space)
Discrete Algorithms and Bioinformatics - On the future of genomic data [Kahn11] and good algorithms (time, space)- A “Meteorological map on the Data Flood” [96-02] [02,06] [06,08] [08, -]
Discrete Algorithms and Bioinformatics Questions: 1. How long does it take for a “foundational advance” in algorithmic theory to be perceived as such in bioinformatics and be applied, as proof of principle, or as the base for a tool 1. Is such a delay related to the “meteorological map” outlined earlier?
Algorithmic Theory and Bio Impact - Four small case studies: - Suffix trees in Computational Biology - Data Compression of biological sequences - Genome scale sequence alignment - Compressive Genomics
Suffix Trees and Comp. Bio. Suffix Tree for the sequence banana$
Suffix Trees and Comp. Bio. Why Useful Searching Word Statistics Data Compression Etc, etc
Suffix Trees and Comp. Bio. A brief history: Weiner 75 Mc Creight 76 Manber and Myers 93-Suffix arrays Ukkonen 95 Gusfield 97: Algorithms on strings, trees and sequences: Computer Science and Computational Biology, Cambridge Univesity Press Gusfield and Stoye 98 Ect., etc.
Suffix Trees and Comp. Bio. Compressed suffix arrays and Self-Indexes Ferragina and Manzini 2000 Grossi and Vitter 2000 Proof of Principle in Comp. Biology: index with a 2G footprint for the Human Genome Sadakane and Shibuya 2001 Lippert 2002
Suffix Trees in Comp. Bio. Compressed suffix arrays and Self-Indexes Ferragina and Navarro 2005 The pizza and chili corpus: highly tuned collections of implementations ready for download and use Velimaki et al. 2007 Experimental study for CSA as a genome scale sequence analysis tool
Suffix Trees and Comp. Bio. Compressed arrays and Self-Indexes Vyvemar et al. 2012: prospects and limitations of full text indexes in genome analysis Essential for: Read Mapping, e.g. Bowtie Short read error correction, genome assembly
Genome scale alignments - MUMer1 and 2- Delcher et al. 1999, 2002 - LAGAN and MultiLagan- Brudno et al,2003 - Suffix trees: Weiner 75, Mc Creight76, Miller and Myers 93, Ukkonen 95 - Sparse Dynamic Programming: H77, HS77, AG87, EGGI92
Data Compression A lossless lossy hopefully |Y| |X| X Y
Data Compression Data Compression in Computational Biology, Giancarlo, Scaturro, Utro, 2009 Compressive Sequence Analysis, Giancarlo, Rombo, Utro 2014 General compression-Rich history, 1948…
Data Compression Compression of biological sequences, Grumbach and Tahi 1993 Period 1993-2007: “only” 17 new methods specialized to biological sequences Period 2008-2013: 36…and counting new methods specialized to NGS data and large genomic sequence collections- a couple of fundamentally new ideas are present: problem to be studied
Compressive genomics In a nutshell: Algorithm A solves problem P on input x=AAAAAAAAAAACCCCCCCCGGGGGG Algorithm A’ solves problem P on input x’= (A,11); (C,8); (G,6) OUTPUT IS THE SAME
Compressive Genomics Protein DataBase Blast Searches on a compressed DataBase, Berger et al. 2012, 2013 Compressed Indexing and DNA Local Alignment, Lam et al., 2008 String Matching over compressed text, Amir et al. 1994 A sub-quadratic sequence alignment algorithm over compressed text, Crochemore et al. 2003,
Discrete Algorithms and Bioinformatics - A data deluge…ehm, universal - Remedies Part 1: Historia Magistra Vitae
Discrete Algorithms and Bioinformatics - A data deluge…ehm, universal - Remedies Part 1:Historia Magistra Vitae
Discrete Algorithms and Bioinformatics - A data deluge…ehm, universal - Remedies Part 1:Historia Magistra Vitae A. Apostolico and M. Crochemore, String pattern matching for a deluge survival kit, 2002
Discrete Algorithms and Bioinformatics - A data deluge…ehm, universal - Remedies Part 1:Historia Magistra Vitae B. Berger, J. Peng, M. Singh, Computational Solutions for omic data, 2013
Discrete Algorithms and Bioinformatics - A data deluge…ehm, universal
Discrete Algorithms and Bioinformatics - A data deluge…ehm, universal - Remedies Part 2: - Algorithmic foundational work to
Discrete Algorithms and Bioinformatics - A data deluge…ehm, universal - Remedies Part 2: - Algorithmic foundational work to: Break the Big Data Wall!!!
Discrete Algorithms - New algorithmic design paradigms External Memory algorithms: Input data reside on disk and are too big to fit in memory Aggarwal and Vitter 1988 An area that has reached full maturity, Comp. Bio. may be reasonably happy with it. Recoil, Yanovsky 2011: Compression of embarassingly large DNA sequence collections Bauer et al., 2012, Lightweight LCP construction for Next Generation Sequencing Datasets
Discrete Algorithms and Bioinformatics - New algorithmic design paradigms Algorithms on Data Streams: the volume of data is so large that one cannot even store it Data is produced “in a stream” and cannot be stored on memory M. Henzinger, P. Raghavan, S. Rajacopalan 1999 Probably not very good for Comp. Bio.
Discrete Algorithms - New algorithmic design paradigms - Succinct data structures: storing data in small space- G.J. Jacobson, 1988 - Promising for Comp. Bio. Full Text Self-Indexes Bloom Filters:Pell et al., 2012: 40-fold reduction in memory requirement for metagenomes assembly Bloom Filters have been invented in 1970
Discrete Algorithms - New algorithmic design paradigms - Synopsis Data Structures: Only a “relevant summary” of the data is kept- Gibbons and Matias, 1998 No Use yet in Comp. Bio., but very promising because of its success in DataBase System design: Iceberg Queries
Discrete Algorithms - New algorithmic design paradigms - Approximation algorithms: well known for hard problems, e.g. TSP, genome assembly New: use it for “resource bounded” problems in order to obtain performance guaranteed approximations Already in use in Comp. Bio. WITHOUT the performance guarantee part…
Conclusions - Since the late 80’s, a solid bridge has been built between Algorithmic Research and Bioinformatics and Comp. Bio. Algorithmic Research seems to be asking the right questions in foundational terms for “BIG DATA”- Biology is a privileged testbed, with a turning point in attention around 1997 The fact that algorithimc research “does not listen to Comp. Bio. needs” is a false metropolitan legend: Having fun learning about algorithmic theory ? We do learning about biology!!!