Presentation on theme: "Large scale genomes comparisons Bioinformatics aspects (Introduction) Fredj Tekaia Institut Pasteur EMBO Bioinformatic and Comparative."— Presentation transcript:
Large scale genomes comparisons Bioinformatics aspects (Introduction) Fredj Tekaia Institut Pasteur EMBO Bioinformatic and Comparative Genome Analysis Course Institut Pasteur Paris June 27 - July 9, 2011 EMBO Bioinformatic and Comparative Genome Analysis Course Institut Pasteur Paris June 27 - July 9, 2011
Starting from genomes (whole sequence, whole gene sequences or whole protein sequences of given species) what Large-scale Genome Comparisons include?
Large-scale genome comparisons: Comparing a genome (in terms of whole sequence, whole set of predicted genes or whole set of predicted proteins) to itself (intra- species comparisons) or to another genome (inter-species comparisons).
Plan: Completely sequences genomes ; Large scale genome comparisons; Results mining: clusters of orthologs and analyses; References.
Large scale genome comparisons -Duplication; -Conservation; -Specificity (species-specific genes, proteins); -Paralogues, orthologues; -Families (clusters) of paralogues, of orthologues; -Genomes organisations (duplicated, conserved genes); -Search for shared motifs in proteins of the same cluster; -Protein conservation profiles; -Selection pressure analyses (synonymous, non synonymous substitutions,..),….
Comparative genomics Analysis and comparisons of genomes from different species. Helps understanding the similarity and differences between genomes, their evolution and the evolution of their genes. Intra-genomic comparisons help understanding the degree of duplication (genome regions; genes) and genes organization,... Inter-genomic comparisons help understanding the degree of similarity between genomes; degree of conservation between genes; Determination of syntenic regions i.e regions conserved in different species;
2a4a Organism A Organism B 1a3a5a6a 2b4b7b3b8b9b Block of synteny Synteny
Time Duplication Speciation A B Duplication G G1 G2 B-G2 1 B-G2 2 A-G2A-G1B-G1 orthologs outparalogs inparalogsoutparalogs Speciation Duplication Inparalogs Orthologs Outparalogs Loss of genes Predict these events by comparing genomes? Speciation - Duplication
Orthologs / Paralogs How to detect orthologous genes? - easy way: best reciprocal hit (RBH) 2.1a 3a 2.1b 3b 1a1b 2.2a2.2b Organism A Organism B
Orthologs / Paralogs - more rigorous: make a phylogenetic tree of this gene family 2.1b 3a 2.1a 3b 1a 1b 2.2b 2.2a - more rigorous: look at synteny conservation 2a3a 1a 2.1b3b 1b 2.2b5b 4b Organism A Organism B
Ancestor species genome Evolutionary processes include Phylogeny* duplication genesis Expansion* HGT Exchange* loss Deletion*selection* Expansion, Exchange and Deletion. Large scale comparative analysis of predicted proteomes revealed significant evolutionary processes:
Gene duplications are traditionally considered as a major evolutionary source for protein new functions Understanding how duplications happened and how important is this evolutionary process is a key goal of genome analysis > Some examples
Kellis et al. Nature, 2004 S. cerevisiae genome Colours reveal Duplications
Kellis et al. Nature, 2004 Speciation Duplication Deletion Actual content of the 2 copiesReconstruction of the ancestral organization
Nature Reviews Genetics 3; (2002); SPLITTING PAIRS: THE DIVERGING FATES OF DUPLICATED GENES
Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206. Original version Actual version
Genome duplication. a, Distribution of Ks values of duplicated genes in Tetraodon (left) and Takifugu (right) genomes. Duplicated genes broadly belong to two categories, depending on their Ks value being below or higher than 0.35 substitutions per site since the divergence between the two puffer fish (arrows). b, Global distribution of ancient duplicated genes (Ks > 0.35) in the Tetraodon genome. The 21 Tetraodon chromosomes are represented in a circle in numerical order and each line joins duplicated genes at their respective position on a given pair of chromosomes. Jaillon et al. Nature 431,
Inter-genome Comparaisons base composition, codons, amino acids,... degree of conservation between genomes, orthologues determination, families (clusters) of orthologues. gene dictionary, gene conservation profiles, genome trees construction, genomes multiple alignments.
Search for similarity
Methods: Important to know how algorithms that allow sequence comparisons work, There are many comparisons methods, Among most used: BLAST FASTA Smith-Waterman algorithm dynamic programming method HMM (Hidden Markov Model)
Sequence Comparaisons V I T K L G T C V G SV I T K L G T C V G S V I S... T Q V G SV. S K. G T Q V. S Identity Similarity Homology
Comparison of 2 sequences Aims at finding the optimal alignment: the one that shows most similar regions and regions that are less similar. In describing sequence comparisons, three different terms are commonly used : Identity, Similarity and Homology. Need for a score that evaluates: - matches - mismatches - gaps and a method that evaluates the numerous possible alignments.
Identity Refers to the occurence of identical nucleotides or amino acids in the same position in aligned sequences ; Identity is objective and well defined; Identity can be quantified: Percent i.e the number of identical matches divided by the length of the aligned region.
Similarity Sequence similarity takes approximate matches into account, and is meaningful only when such substitutions are scored according to some measure of «difference» with conservative substitutions assigned more favorable scores than non-conservative ones (substitution matrices). Given a number of parameters (alphabet, scoring matrix, filtering procedure, etc...), the similarity of an aligned region is defined by a score calculated on that region; The score depends on the chosen parameters; Contrarily to homology : expression like significant or weak similarity are often used.
Homology Sequence homology underlies common ancestry and sequence conservation; Homology can be inferred, under suitable conditions from sequence similarity ; The main objective of sequence similarity searching studies aims at inferring homology between sequences; Homology is not a measure. It is an all or none relashionship (i.e homology exits or does not exist. Expressions like : significant or weak homology are meaningless!). Sequence similarity is a measure of the matching characters in an alignment, whereas homology is a statement of common evolutionary origin.
Local Alignment Global Alignment
Compare one query sequence to a BLAST formatted database
Amino acid scoring schemes (substitution matrices) All algorithms comparing protein sequences rely on some schemes to score the equivalence of each of the 210 possible pairs of amino acids. As a result : what a local alignment program produces depends strongly upon the scores it uses. implicitly a scheme may represent a particular theory of evolution, choice of a matrix can strongly influence the outcome of an analysis. The scores in the matrix are integer values which assign a positive score to identical or similar character pairs, and a negative value to dissimilar character pairs. S ij = (ln(q ij /p i p j ))/ u ; q ij are target frequencies for aligned pairs of amino acids, the p i and p j are background frequencies, and u is a statistical parameter.
Examples of substitution matrices # PAM250 substitution matrix, scale = ln(2)/3 = # Expected score = , Entropy = bits # Lowest score = -8, Highest score = 17 A R N D C Q E G H I L K M F P S T W Y V B Z X * A R N D C Q E G H I L K M F P S T W Y V B Z X *
BLOSUM62 Clustered Scoring Matrix in 1/2 Bit Units # Cluster Percentage: >= 62 # Lowest score = -4, Highest score = 11 A R N D C Q E G H I L K M F P S T W Y V B Z X * A R N D C Q E G H I L K M F P S T W Y V B Z X *
PAM matrices (Dayhoff et al. (1978)) PAM stands for “point accepted mutation”. 1 PAM corresponds to 1 amino acid change per 100 residues, 1 PAM ~1% divergence, Extrapolate to predict patterns at longer distances. Assumptions : replacements are independent of surrounding residues, sequences being compared are of average composition, all sites are equally mutable, Source of error : small, globular proteins were used to derive PAM matrices (departure from average composition) errors in PAM1 are magnified up to PAM250,.... does not account for conserved blocks or motifs. Strategy : PAM40short alignments, highly similar PAM120average similarity PAM250longer, weaker local alignments.
BLOSUM matrices (Henikoff, S., and Henikoff, J., G. (1992)) BlosumX denotes a matrix obtained from alignments of clustered sequence segments with more than X% identity. Examples : - Blosum62 is obtained from clustered sequences with identity greater than 62%. - Blosum80 is obtained from clustered sequences with identity greater than 80%. Which substitution matrix to choose? Blosum80Blosum62Blosum45 PAM10PAM120PAM250 Less divergent More divergent
Position Specific Scoring Matrix (PSSM) - Conserved motifs are identified and amino acid profile matrix for each motif is calculated. -This matrix (n x 20 aa ) is representative of the relative amino acid probabilities at specific positions and is characteristic of a protein family. -Such matrices are used by the profile database searching programs (including PSI-BLAST and HMM based programs).
Example of a PSSM matrices determined (PSI-BLAST program): A R N D C Q E G H I L K M F P S T W Y V 1 M S S S S G L K Q Q G L A Q K K K F Q L E F D I P L
(2) Compare the word list to the database and identify exact matches. Blast algorithm: (3)For each word match, extend alignment in both directions to (1) Query sequence: list of high scoring words of length w. Query Sequence of length L Maximum of L-w+1 words; w=3, List the words that score at least T using a substitution matrix (Bosum62 or PAM250,...)..... DB sequences Extract matches of words from word list. Maximal Segment Pairs (MSPs): HSPs find alignments with scores > S
E-values: Statistics of HSP scores are characterized by two parameters, K and. The expected number of HSPs with score at least S is given by: E = Kmne - S (Karlin & Altschul,1990). m and n are sequence lengths. E is the E-value for the score S. Bit scores: S ’ = ( S – lnK)/ln2 The E-value corresponding to a given bit score is : E = mn2 -S’. (note mn). P-values: The probability of finding exactly a HSPs with score >= S is given by : P(a) = e -E.E a /a! (Poisson distribution), where E is the E-value of S given by the above equation. Finding zero HSP with score >=S is P(0) = e -E, so the probability of finding at least one such HSP is : P = 1 - e -E.
Large-scale proteome comparisons
The expected number of HSPs with score at least S is given by: E = Kmne - S. m and n are sequence and database lengths.
Systematic Analysis of Completely Sequenced Organisms In silico species specific comparisons; Degree of ancestral duplication and of ancestral conservation between pairs of species; Families of paralogs (Partition-MCL); Families of orthologs (Partition-MCL); Determination of the protein dictionary (orthologs); Determination of protein conservation profiles;
Homologs - Paralogs - Orthologs Homologs: A 1, B 1, A 2, B 2 Paralogs : A 1 vs B 1 and A 2 vs B 2 Orthologs: A 1 vs A 2 and B 1 vs B 2 S1S1 S2S2 ab Sequence analysis Species-1Species-2 Duplication Ancestor Evolution Speciation A1A1 A2A2 B1B1 B2B2 A B A B A
Time Duplication Speciation A B Duplication G G1 G2 B-G2 1 B-G2 2 A-G2A-G1B-G1 orthologs outparalogs inparalogsoutparalogs Orthologs - inparalogs - outparalogs Sequence similarities between out-paralogs should be larger than those between orthologs and in-paralogs; Orthology assignments are consistent among several genome pairs; Orthologues are present in syntenic order Kuzniar A, van Ham RC, Pongor S, Leunissen JA. (2008). The quest for orthologs: finding the corresponding gene across genomes.Trends Genet. 24(11): Review.
Altenhoff AM, Dessimoz C. (2009). Phylogenetic and functional assessment of orthologs inference projects and methods. PLoS Comput Biol. 5(1):e Kuzniar A, van Ham RC, Pongor S, Leunissen JA. (2008). The quest for orthologs: finding the corresponding gene across genomes.Trends Genet. 24(11): Review. Gabaldon T. (2008). Large-scale assignment of orthology: back to phylogenetics?Genome Biol. 9(10):235. Moreno-Hagelsoeb G, Latimer K. (2008). Choosing BLAST options for better detection of orthologs as reciprocal best hits. Bioinformatics. 3 : Chen F, Mackey AJ, Vermunt JK, Roos DS (2007). Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS ONE. 2:e383. Goodstadt L, Ponting CP (2006). Phylogenetic reconstruction of orthology, paralogy, and conserved synteny for dog and human. PLoS Comput Biol. 2:e133.
Working Examples Comparing S. cerevisiae (SC) genome with C. elegans (CE) genome
Conclusion Large-scale analyses of Completely sequenced genomes allow a systematic vision of genes, genome organization and their macro as well their micro evolutions. Starting step for further evolutionary analyses that will be dealt with during this course.