Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Comparative genomics, genome context.

Slides:



Advertisements
Similar presentations
Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970)
Advertisements

1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Classification Test Practice by Mrs. Yantosh 1. Which of the following is in the correct order? A. Kingdom, phylum, class, order, family, species, genus.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Comparative genomics Joachim Bargsten February 2012.
Ch 17 – Classification of Organisms
Current Approaches to Whole Genome Phylogenetic Analysis Hongli Li.
Benchmarking Orthology in Eukaryotes Nijmegen Tim Hulsen.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Finding approximate palindromes in genomic sequences.
Protein-protein interactions
Bioinformatics and Phylogenetic Analysis
FOG: High-Resolution Fungal Orthologous Groups René van der Heijden Project 5.10: Comparative genomics for the prediction of protein function and pathways.
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Protein Modules An Introduction to Bioinformatics.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
AN INTRODUCTION TO TAXONOMY: THE BACTERIA
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Comparative Genomics of the Eukaryotes
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Microbial taxonomy and phylogeny
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Power laws, scalefree networks, the.
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
Chapter 5 Genome Sequences and Gene Numbers. 5.1Introduction  Genome size vary from approximately 470 genes for Mycoplasma genitalium to 25,000 for human.
HOGENOM a phylogenomic database
Two Types of Cells Prokaryotic Cell vs. Eukaryotic Cells.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Improving Gene Function Prediction Using Gene Neighborhoods Kwangmin Choi Bioinformatics Program School of Informatics Indiana University, Bloomington,
Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Using (and abusing) sequence analysis.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
Construction of Substitution Matrices
Basic terms:  Similarity - measurable quantity. Similarity- applied to proteins using concept of conservative substitutions Similarity- applied to proteins.
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Genome Analysis II Comparative Genomics Jiangbo Miao Apr. 25, 2002 CISC889-02S: Bioinformatics.
Genomic and comparative genomic analysis BIO520 BioinformaticsJim Lund.
Using blast to study gene evolution – an example.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Global Annotation of the Protein Kinase Family Michael Gribskov University of California, San Diego.
Phylogeny & Systematics
Construction of Substitution matrices
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.
1 Computational functional genomics Lital Haham Sivan Pearl.
1) To explain how scientists classify living things 2) To identify the 6 kingdoms of life.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
(Quantitative, Evolution, & Development)
bacteria and eukaryotes
Sequence similarity, BLAST alignments & multiple sequence alignments
Comparative Analysis in BioCyc
Nothing in (computational) biology makes
Basics of Comparative Genomics
Sequence based searches:
FLiPS Functional Linkage Prediction Service.
Genome Annotation Continued
Pairwise Sequence Alignment
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
Yamanishi, M., Itoh, M., Kanehisa, M.
Basics of Comparative Genomics
What are the two types of cells that make up all life on Earth?
Presentation transcript:

Nothing in (computational) biology makes sense except in the light of evolution after Theodosius Dobzhansky (1970) Comparative genomics, genome context and genome annotation

Genome context analysis and genome annotation Using information other than homologous relationships between individual gene/proteins for functional prediction (guilt by association) phyletic patterns domain fusion (“Rosetta Stone” proteins) gene order conservation co-expression …. Types of context analysis:

Goals: COGs Using gene sets from complete genomes, delineate families of orthologs and paralogs - Clusters of Orthologous Groups (of genes) (COGs) Using COGs, develop an engine for functional annotation of new genomes Apply COGs for analysis of phylogenetic patterns

COG: - group of homologous proteins such that all proteins from different species are orthologs (all proteins from the same species in a COG are paralogs )

Complete set of proteins from the analyzed genomes FULL SELF-COMPARISON (BLASTPGP, no cut-off) Collapse obvious paralogs Merge triangles with common edges CONSTRUCTION OF COGs FOR 8 COMPLETE GENOMES Detect all interspecies Best Hits (BeTs) between individual proteins or groups of paralogs Detect all triangles of consistent BeTs 4 5 Detect groups with multidomain proteins and isolate domains REPEAT STEPS COGs

A TRIANGLE OF BeTs IS A MINIMAL, ELEMENTARY COG

A RELATIVELY SIMPLE COG PRODUCED BY MERGING ADJACENT TRIANGLES

A COMPLEX COG WITH MULTIPLE PARALOGS

Current status of the COGs 11 Archaea + 1 unicellular eukaryote + 46 bacteria = 58 complete genomes 149,321 proteins105,861 proteins in 4075 COGs (71%) 4 animals + 1 plant + 2 fungi + 1 microsporidium = 8 complete genomes 142,498 proteins 74,093 proteins in 4822 COGs (52%) Prokaryotes Eukaryotes

COGnitor...

…IN ACTION

The Universal COGs

Search for genomic determinants of hyperthermophily

Search for unique archaeo-eukaryotic genes

A complementary pattern: search for unique bacterial genes

Essential function… but holes in the phyletic pattern Strict complementary pattern

Relaxed complementary pattern

Relaxed complementary pattern with extra restrictions

Conservation of gene order in bacterial species of the same genus M. genitalium vs M. pneumoniae

Conservation of gene order in closely related bacterial genera C. trachomatis vs C. pneumoniae

Lack of gene order conservation - even in “closely related” bacteria of the same Proteobacterial subdivision P. aeruginosa vs E. coli

Genome Alignments - Method Protein sets from completely genomes BLAST cross-comparison Pairwise Genome Alignment Local alignment algorithm Lamarck (gap opening penalty, gap extension penalty); statistics with Monte Carlo simulations Table of Hits Template-Anchored Genome Alignment

Genome Alignments - Statistics Distribution of conserved gene string lengths

Genome Alignments - Statistics PairwiseNo.No.% in % in alignments: strings genes Gen1Gen2 all homologs ecoli-hinf %33% ecoli-bsub893228%8% ecoli-mjan10301%2% probable orthologs ecoli-hinf %28% ecoli-bsub341684%4% ecoli-mjan12331%2%

Genome Alignments - Statistics Not in gene strings In non-conserved gene strings (directons) In conserved gene strings Breakdown of genes in the genome

Genome Alignments - Statistics Fraction of the genome in conserved gene strings - from template-anchored alignments MinimumSynechocystis sp.5% Aquifex aeolicus10% Archaeoglobus fulgidus13% Escherichia coli14% Treponema pallidum17% MaximumThermotoga maritima23% Mycoplasma genitalium24%

Context-Based Prediction of Protein Functions A Novel Translation Factor (COG0536) L21L27GTPase? GTP-binding translation factor

Context-Based Prediction of Protein Functions A Novel Translation Factor (COG0012) TGS domain containing GTPase? Peptidyl-tRNA hydrolase GTP-binding translation factor