From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo.

Slides:

Advertisements

Similar presentations

Paolo Ferragina, Università di Pisa XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio,

Advertisements

Paolo Ferragina, Università di Pisa On Compression and Indexing: two sides of the same coin Paolo Ferragina Dipartimento di Informatica, Università di.

Information Retrieval in Practice

Lecture #1 From 0-th order entropy compression To k-th order entropy compression.

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.

Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.

Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.

Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.

Paolo Ferragina, Università di Pisa Compressing and Indexing Strings and (labeled) Trees Paolo Ferragina Dipartimento di Informatica, Università di Pisa.

BWT-Based Compression Algorithms Haim Kaplan and Elad Verbin Tel-Aviv University Presented in CPM ’07, July 8, 2007.

PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.

Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.

Probabilities and Probabilistic Models

What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.

Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.

A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.

String Processing II: Compressed Indexes Patrick Nichols Jon Sheffi Dacheng Zhao

Fudan University Chen, Yaoliang 1. TTS System A Chinese Text-To-Speech system SafeDB Bug backlog SMemoHelper A small tool that helps learn English.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino

On Evaluating the Performance of Compression Based Techniques for Sequence Comparison R AMEZ M INA † D HUNDY B ASTOLA †, * AND H ESHAM A LI †, * †College.

Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.

Blockwise Suffix Sorting for Space-Efficient Burrows-Wheeler Ben Langmead Based on work by Juha Kärkkäinen.

Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.

Bioinformatics and Phylogenetic Analysis

Web Algorithmics Dictionary-based compressors. LZ77 Algorithm’s step: Output Advance by len + 1 A buffer “window” has fixed length and moves aacaacabcaaaaaa.

Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.

Abstract Data Types (ADT)

Biology How Does Information/Entropy/ Complexity fit in?

A Simpler Analysis of Burrows-Wheeler Based Compression Haim Kaplan Shir Landau Elad Verbin.

Advanced Algorithms for Massive DataSets Data Compression.

Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.

Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.

Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.

Paolo Ferragina, Università di Pisa Prologo Paolo Ferragina Dipartimento di Informatica, Università di Pisa.

Too much information running through my brain.. We live in the information age. Knowledge comes from careful investigation of information. Information.

Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Information and Coding Theory Heuristic data compression codes. Lempel- Ziv encoding. Burrows-Wheeler transform. Juris Viksna, 2015.

Today  Table/List operations  Parallel Arrays  Efficiency and Big ‘O’  Searching.

(Important to algorithm analysis )

Algorithmic Information Theory, Similarity Metrics and Google Varun Rao.

Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)

TABLE COMPRESSION AND RELATED PROBLEMS Raffaele Giancarlo Dipartimento di Matematica Università di Palermo.

Parallel Data Compression Utility Jeff Gilchrist November 18, 2003 COMP 5704 Carleton University.

B ACKWARD S EARCH FM-I NDEX (F ULL - TEXT INDEX IN M INUTE SPACE ) Paper by Ferragina & Manzini Presentation by Yuval Rikover.

BACKWARD SEARCH FM-INDEX (FULL-TEXT INDEX IN MINUTE SPACE)

Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,

Information Retrieval

The Burrows-Wheeler Transform: Theory and Practice Article by: Giovanni Manzini Original Algorithm by: M. Burrows and D. J. Wheeler Lecturer: Eran Vered.

Index construction: Compression of documents Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Managing-Gigabytes: pg 21-36, 52-56,

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

ETRI Linear-Time Search in Suffix Arrays July 14, 2003 Jeong Seop Sim, Dong Kyue Kim Heejin Park, Kunsoo Park.

Computer Sciences Department1. 2 Data Compression and techniques.

Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo

CPS 100, Spring Burrows Wheeler Transform l Michael Burrows and David Wheeler in 1994, BWT l By itself it is NOT a compression scheme  It’s.

On searching and extracting strings from compressed textual data Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.

A simple storage scheme for strings achieving entropy bounds Paolo Ferragina and Rossano Venturini Dipartimento di Informatica University of Pisa, Italy.

Compressed storage of the Web-graph

Linear Time Suffix Array Construction Using D-Critical Substrings

Burrows-Wheeler Transformation Review

(Important to algorithm analysis )

Succinct Data Structures

Information and Coding Theory

Advanced Algorithms for Massive DataSets

Reducing the Space Requirement of LZ-index

(Important to algorithm analysis )

Problem with Huffman Coding

CPS 296.3:Algorithms in the Real World

Presentation transcript:

From Kolmogorov and Shannon to Bioinformatics and Grid Computing Raffaele Giancarlo Dipartimento di Matematica, Università di Palermo

Aim Give a flavour of fundamental novel discoveries about indexing and compression: A string, and any compact encoding of it, is the best index for itself Give a flavour of some fundamental novel discoveries about Distance functions and Classification, particularly relevant for Bioinformatics On the way, mention uses of :suffix trees, suffix arrays, Burrows- Wheelet Transform, Move to Front… In 30 min. an incredibly long jurney: From Kolmogorov and Shannon to Grid Computing References: available on-line

What do we mean by “Indexing” ? Raw sequence of characters or bytes Types of data Types of query Character-based query Indexing approaches : Full-text indexes, » Suffix Array, Suffix tree,… DNA sequences Audio-video files Executables Arbitrary substring

What do we mean by “Compression” ? Any Algorithm that squezes data : lossless, lossy  From March 2001 the Memory eXpansion Technology (MXT) is available on IBM eServers x330MXT Same performance of a PC with double memory but at half cost Moral: More economical to store data in compressed form than uncompressed » CPU speed nowadays makes (de)compression “costless” !!

What we mean by “Classification” ? Any tool that can group “related” objects together, e.g. the unaligned mithocondrial genomes NCBI Classfication

Compression and Indexing : Two sides of the same coin !  Do we witness a paradoxical situation ? An index injects redundant data, in order to speed up the pattern searches Compression removes redundancy, in order to squeeze the space occupancy NO, new results proved a mutual reinforcement behaviour ! Better indexes can be designed by exploiting compression techniques Better compressors can be designed by exploiting indexing techniques In terms of space occupancyAlso in terms of compression ratio Classification is the “third side” of the coin: Kolmogorov Complexity, Information Theory, Compression and Indexing

Our journey, today... Suffix Array (1990) Index design (Weiner ’73)Compressor design (Shannon ’48) Burrows-Wheeler Transform (1994) Compressed Index -Space close to gzip, bzip - Query time close to O(|P|) Compression Booster Tool to transform a poor compressor into a better compression algorithm Universal Distances and Classification Kolmogorov

Investigate Indexing ideas  Compressor design First Lap…in record time!!! Booster

Key Idea 1: Suffix Tree [Weiner 73, McCreight 76, Ukkonen 92] String: mississippi# 12 1 # i pm s 119 # ppi# ssi 52 ppi# ssippi# 109 i# pi# i si 74 ppi# ssippi# 63 ppi# ssippi#

pi#mississi p ppi#mississ i sippi#missi s sissippi#mi s ssippi#miss i ssissippi#m i issippi#mis s mississippi # ississippi# m Key Idea 2: Burrows-Wheeler Compression (1994) Let us be given a string s = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi ssippi#missi issippi#miss Sort the rows #mississipp i i#mississip p ippi#missis s bwt(s) s

Burrows and Wheeler Compression Why it works: BWT creates a locally homogeneous string: abaababa bbbaaaaa MTF transforms it into a globally homegeneous sequence of integers bbbaaaaa The final string is “easy” to compress Experimentally: compressibility is proportional to % of zeros

Qualitatively, it can be shown: c’ is shorter than c, if s is compressible Time( A boost ) = Time ( A ), i.e. no slowdown A is used as a black-box Boosting [Ferragina, Giancarlo, Manzini, Sciortino, 03,04,05] The technique takes a poor compressor A and turns it into a compressor A boost with better performance guarantee c’c’ Booster The better is A, the better is A boost A sc The more compressible is s, the better is A boost

We investigated: Index Ideas  Compression design Let’s now turn to the other direction Compression ideas  Index design Second Lap…Even faster Compressed Indexes

Rotated text #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m Suffix Array vs. BW-transform ipssm#pissiiipssm#pissii L SA L includes SA and T. Can we search within L ? mississippi

A compressed index [Ferragina-Manzini, IEEE Focs 2000] In practice, the index is much appealing: Space close to the best known compressors, ie. bzip Query time of few millisecs on hundreds of MBs The theoretical result: Query complexity: O(p + occ log  N) time Space occupancy: O( N H k (T) ) + o(N) bits k-th order empirical entropy

Universal Distances and Classification Third Lap…

Large Data Sets Classification of Sequences on a Genome-wide Scale Distances based on alignments are either not applicable or too slow Fast and reliable alignment-free methods are badly needed Classification of Proteins, both for Function and Structure- Lagging behind to sequence data

Proteins and Their String Representations Amino acid sequence (FASTA format); Atomic coordinates (Atom lines)‏;

Protein Representations Topologic Models (Top Diagrams)‏

Kolmogorov Complexity The Kolmogorov Complexity K(x) of a string x is defined as the length of the shortest binary program that produces x. The conditional Kolmogorov Complexity K(x|y) represents the minimum amount of information required to generate x by an effective computation when y is given as an input to the computation. The Kolmogorov Complexity K(x,y) of a pair objects x and y is the length of the shortest binary program that produces x and y and a way to tell them apart.

Universal Similarity metric (USM)‏ Problem: USM(x,y) is based on Kolmogorov Complexity that is non- computable in the Turing sense. Solution: K(x) can be approximated via data compression by using its relationship with Shannon Information Theory. USM is a methodology rather than a formula quantifying the similarity of two strings.

Approximations of USM K(x) can be approximated by C(x), K(x,y) by C(xy) and K(x|y*) by C(xy) – C(x). We obtain three approximations to USM: where

Experiments [Ferragina, Giancarlo, Greco, Manzini, Valiente, 2007 ] Experimental setup: Five Benchmarck datasets of proteins (several alternative representations); A benchmark dataset of Genomic sequences (complete unaligned mitochondrial Genomes)‏; Twenty-five compression algorithms; Three dissimilarity functions based on USM. Two set of experiments to compare USM both with methods based on alignments and not: via ROC Analysis; via UPGMA and NJ.

An example Unaligned mitochondrial DNA complete Genomes

Results and Conclusions Useful Guidelines for Use of USM Methodilogy for Biological Investigation Which compressor to use Which among UCD,NCD and CD to use Which data representation is best Etc…

Software Kolmogorov Library: Sequential processing is too slow even for relatively small data sets, i.e, 278 files (1.5Mb) classification takes 12 hours on a state of the art PC…half an hour on Grid Soon Available as a Grid-aware Web Service on COMETA Portal

Adevertisement 2 20° EDition of Lipari International Summer School for Computer Scientists TOPIC: Algorithms, Science and Engineering See Lipari School Website