13th Portugaliae Genetica IPATIMUP, Porto 19 March 2010 1 New applications of alignment-free methods for biological sequence analysis and comparison Instituto.

13th Portugaliae Genetica IPATIMUP, Porto 19 March 2010 1 New applications of alignment-free methods for biological sequence analysis and comparison Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa Susana Vinga

The –omics era 2

Bioinformatics Multidisciplinary area Experimental (manipulation and measurement) Computational (mining and modeling)  Computer science, dynamic systems, statistics, graph theory, molecular biology, biochemistry, physiology,… Bioinformatics & Computational Biology Alignment-free methodologies based on vector maps Genome analysis Sequence analysis Structural bioinformatics Gene expression Genetics and population analysis Systems biology Databases and ontologies Phylogenetics Data and text mininghttp://bioinformatics.oupjournal.org

Outline Motivation & Introduction –Biological sequence analysis and comparison – alignment- based and alignment-free strategies –Modeling strategies, problem definition and concepts Methods & Examples (global and local analysis) –Resolution based methods – L-tuple composition –Iterated function systems – CGR/USM and genomic signatures –Information theory – Renyi entropy of DNA –Markov chain models – statistical significance of motifs –Entropic profiles Summary & Conclusions

http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html How to… Classify Analyze Integrate this increasing amount of complex information? DNA GenBank

Sequence alignment share function homology The fundamental idea of alignment is that sequences that share the same substrings might have the same function or be related by homology. 6

Aligment-based algorithms Needleman-Wunsh (1970) Smith-Waterman (1981) Build up the best alignment by using optimal alignments of smaller subsequences Example of dynamic programming (Richard Bellman,1953): –A divide-and-conquer strategy: –Break the problem into smaller subproblems. –Solve the smaller problems optimally. –Use the sub-problem solutions to construct an optimal solution for the original problem. 7

Alignment-based algorithms 8

Multiple aligment Clustal 9

BLAST 27,556 Times Cited (ISI): 27,556 10

Aligment-free methods review (2003) 12

Vector-valued functions of biological sequences 13 GAATTCT AATCTCC CTCTCAA CCCTACA GTACCCA (f 1,…,f n )

Words in sequences L-tuple composition Count “words” in sequence 14 Example DNA sequence X=GTGTGA, extract and count (overlapping) 3-tuples

Metrics and dissimilarities Euclidean Cosine Minkowski - City-block (m=1) 15 

Dissimilarities L-tuple Resolution-free Sucessuful applications 16 Vinga (2007) Editors: T.D. Pham et al, pp. 71-105.

Based in 1-tuple frequencies and PAM/Blosum weights 17 W-metric Vinga et al. Bioinformatics, 2004. 20(2): p. 206-215.

Transforming L-tuples Processing frequnecy vectors Normalize, filtrate, feature selection, using algebraic and statistical tools Normalize by expected frequencies (Pietrokovski et al., 1990) –according to (L-1)-tuple and (L-2)-tuple: contrast L ‑ vocabulary (CV) Oligonucleotide bias (Rocha et al., 1998) –over- and under represented L-tuples in genomic datasets might indicate phenomena of positive/negative selection in B.subtilis Shortest unique substrings (Haubold et al., 2005) –Occur only once and cannot be further reduced in length without losing the property of uniqueness. Caenorhabditis elegans, human and mouse genomes UniMarkers (Chen et al., 2002) –fixed-length unique sequence markers might be used to assign the genomic positions of SNP sites. UM’s appear only once in the genome thus allowing to locate SNP’s much faster that alignment-based methods. Create synteny maps (Liao et al., 2004) 18

Human beta globin genes 19

HUMHBB classification 20

10 European Languages

Natural languages 22 http://commons.wikimedia.org/wiki/File:Languages_of_Europe.png

Phylogenetic inference Genome trees and the nature of genome evolution (Snel et al., Annual Rev Microbiol, 2005) –Alignment-Free Genome Trees - reconstruction methods use a statistic of the entire genomic DNA, or of all encoded proteins, to derive a distance between genomes that is then used to cluster them. “The fact that these alignment-free methods do not incorporate so much standard molecular evolutionary methodology and proven powerful evolutionary concepts raises interesting questions, especially because they perform reasonably well” –Kolmogorov complexity (Li et al. Bioinformatics 2001) and Lempel- Ziv complexity (Otu et al. Bioinformatics 2003) - Mitochondria –Qi J et al (2004).; Volkovich Z et al. (2010); Zheng, XQ et al. (2009). –Haubold et al (2009). Estimate mutation distances 23

Hybrid approaches Edgar, R. C. (2004). MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 5, 113. Sperisen, P. & Pagni, M. (2005). JACOP: a simple and robust method for the automated classification of protein sequences with modular architecture. BMC Bioinformatics, 6, 216. 24

Iterated Function Systems (IFS) CGR Chaos Game Representation (CGR) for DNA USM Higher-order generalization Universal Sequence Maps (USM) discrete sequences continuous map Represent discrete sequences in a continuous map (bijection) Markov Chains Relation with Markov Chains and suffix properties Jeffrey, H. J. (1990). Chaos game representation of gene structure. Nucleic Acids Res, 18(8):2163–2170. Almeida, J. S. and Vinga, S. (2002). Universal sequence map (USM) of arbitrary discrete sequences. BMC Bioinformatics, 3(1):6.

CGR/USM Algorithm 0.10.20.30.40.50.60.70.80.91 T G A 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 C A... ATGCGA... AT... ATGCGAGT... ATGCGAGTGT... ATG... ATGCGAGTG... ATGCGAG... ATGCG... ATGC... X=ATGCGAGTGT... ATGC G G G TT A (0,1) (0,0) (1,0) (1,1)...

C G T A CGR/USM Example

Suffix property AAAGTAGGATAGTT Generalization of Markov chain models Transition probability tables k-order  2 k+1 divisions A given L-tuple (e.g. AGT) is always mapped in the same region – fractal properties

Genomic signatures GENSTYLE Deschavanne et al. 1999 Mol Biol Evol Pervasive patterns even for ‘short’ DNA segments

Generalizations How to maximally “fill” the space? Sierpinsky triangle (n=3) CGR (n=4) … Very academic… 30 Almeida and Vinga BMC Bioinformatics 2009 10:100

Information Theory Definition Shannon’s entropy N states with probabilities p i, i=1,...,N L-tuple frequencies Measures randomness level of given system (or predictability, complexity,...)

Rényi entropy order  of probability distributions continuous discrete Generalization of Shannon

Example: N=2 Coin A not necessarily fair coin… Maximum entropy when p=0.5 Highest unpredictability

   Normal distribution 1. Represent the DNA sequence Chaos Game Representation/Universal Sequence Maps (CGR/USM) 2. Estimate probability density function (pdf) Parzen’s window method using normal or gaussian kernels with different variances  2 3. Calculate entropy of estimated pdf Rényi continuous entropy of DNA Simplification 4. Compare with random model Global Rényi entropy of DNA sequences

Algebraic simplification When using Gaussian kernels, Rényi quadratic entropy  =2 simplifies to: Vinga & Almeida 2004 J Theor Biol Exact results No numerical integration

Results – pdf estimation pdf estimation f -ATC- motif Over-represented ^ Suffix property = High density

Global Rényi entropy length N=2000 less random more random

Random sequences - medians ln N length N

Local sequence information Statistical significance of motifs SMILE program: Structured Motifs Inference and Evaluation (Sagot et al.) and RISO (Carvalho et al. 2006) –Exact algorithm to find motifs or models in sequence sets –Calculates statistical significance of patterns found based on permutation tests and Markov Chain Models Flexible input parameters – Example: * * * * * * (* * *)* * * * * (* * ) BOX 1 = 6 - 9BOX 2 = 5 - 7 SPACER = 15-19 ERROR for each box and total

Application Inference of conserved motifs Organisms: E.coli, B.subtilis, H.pylori Several promoter families: –TTAAGC [19-23] TATAAT –TTTTAA [10-14] TATAAT –TTGACA [15-19] TATAAT Vanet et al. 2000 JMB High statistical significance  Biological significance?!?

Motifs Xu, M. L. and Z. C. Su (2010). "A Novel Alignment-Free Method for Comparing Transcription Factor Binding Site Motifs." Plos One 5(1). Comin, M. and D. Verzotto (2010). "Classification of protein sequences by means of irredundant patterns." BMC Bioinformatics 11. Casimiro, A.C., Vinga, S., Freitas, A.T., and Oliveira, A.L. (2008) An analysis of the positional distribution of DNA motifs in promoter regions and its biological relevance. BMC Bioinformatics 9, 89. 41

Distribution of motifs Analyze the position where the motif occurs Test for uniformity Motifs that are both overrepresented and non-uniformly distributed might be biologically significant 42 Casimiro, A.C., et al. (2008) BMC Bioinformatics 9, 89. S. Cerevisiae Promoter sequences TF Aft2p, Dal80p, Gln3p, Met4p and Gat1p

Uniformity tests Small samples  choose “best” uniformity test –Chi-Square, Kolmogorov-Smirnov and bootstrap Chi-Square –Optimize number of bins –Specificity and ROC curves 43

Applications of entropy to extract local information Linguistic complexity Coding vs. non-coding comparisons S. cerevisiae H. influenzae Troyanskaya et al. 2002 Bioinformatics Crochemore & Vérin 1999 Comput Chem Low entropy=high repeatability

Entropic Profiles Local information plots that indicate overall conservation of motifs in genomes unfolding Obtained by unfolding the probability density function used in Renyi global entropy estimation using CGR New tool implementation, based on new data structures and algorithmic simplifications, allows to process whole genomes in few minutes. http://kdbio.inesc-id.pt/software/ep/ 45 Vinga and Almeida (2007) BMC Bioinformatics, 8:393 Fernandes et al. (2009) BMC Research Notes 2:72.

New kernel function applied to CGR L resolution  smoothing 46 Almeida and Vinga Algorithms for Molecular Biology 2006 1:18

Properties CGR properties are maintained Domain is respected Estimation of pdf is also straightforward 47 Almeida and Vinga Algorithms for Molecular Biology 2006 1:18

EP algorithm Calculation: Calculation: simplified to suffix counts 48 Entropic profile for the ith symbol s i, coordinate x i  L is the length resolution   is a smoothing parameter Number of motifs (s i-k+1 …s i ) in the whole sequence Vinga and Almeida (2007) BMC Bioinformatics

Whole genome case-studies 49 For L>6 Chi site motif emerges Entropic profile Input parameters Statistical significance

Whole genome case-studies 50 Maximum at L=8 (motif length) EP max >7 std Maximum at L=8 (motif length) EP max >7 std

Position study Escherichia coli genome 51  Corresponds to a Chi site (Crossover Hotspot Instigator) (5’-GCTGGTGG-3’) *key region that modulates the exonuclease activity of RecBCD, an enzyme that is necessary for chromosomal dsDNA repair and integration of exogenous dsDNA) The detection of relevant and statistically significant segments can be accomplished unsupervisedly by spanning the parameters space to find local maxima.

Motif Study Haemophilus influenza genome 52 USS+ highly overrepresented EP max >10 USS+ highly overrepresented EP max >10 Histogram  Analysis of the motif which represents a USS+ (Uptake Signal Sequence) 5’-AAGTGCCGGT-3’) *USSs are involved in natural competence, which is a genetically controlled form of horizontal gene transfer in some bacterial species

EP conclusions Entropic profiles (EP) provide local information about global features of DNA Excellent performance for sequences up to 2Gbp (time and memory) Whole genomes testing corroborate the strengths of this approach to detect biologically meaningful DNA segments, related with the detection of local scales and suffix/motifs over or under-representation 53

Alignment-free algorithms Advantages –General approach with rich collection of methods –Robust to shuffling and recombination events –Applicable even when less conservation is present –All genome information can be considered –Computationally less intensive –Symbol order is sometimes neglected Disadvantages –Methods are less developed and integrated – Limited detailed local information – hard to identify point mutations, indels –Less discriminating for querying databases and genome searches –Symbol order is sometimes neglected 54

Summary and conclusions Global and local sequence analysis –Metrics, CGR, information theory, statistics Alignment-free techniques can provide new tools to… Classify Analyze Integrate … biological sequence data “All Models Are Wrong But Some Are Useful” George E.P. Box

Acknowledgments Prof. Jonas S. Almeida KDBIO Group @ INESC-ID Biomathematics Group @ ITQB Project DynaMo ( PTDC/EEA- ACR/69530/2006 ) from FCT Thank you! http://kdbio.inesc-id.pt/~svinga

13th Portugaliae Genetica IPATIMUP, Porto 19 March 2010 1 New applications of alignment-free methods for biological sequence analysis and comparison Instituto.

Similar presentations

Presentation on theme: "13th Portugaliae Genetica IPATIMUP, Porto 19 March 2010 1 New applications of alignment-free methods for biological sequence analysis and comparison Instituto."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

13th Portugaliae Genetica IPATIMUP, Porto 19 March 2010 1 New applications of alignment-free methods for biological sequence analysis and comparison Instituto.

Similar presentations

Presentation on theme: "13th Portugaliae Genetica IPATIMUP, Porto 19 March 2010 1 New applications of alignment-free methods for biological sequence analysis and comparison Instituto."— Presentation transcript:

Similar presentations

About project

Feedback