Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

Slides:



Advertisements
Similar presentations
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Advertisements

Lecture 4: DNA transcription
Global Mapping of the Yeast Genetic Interaction Network Tong et. al, Science, Feb 2004 Presented by Bowen Cui.
Basics of Comparative Genomics Dr G. P. S. Raghava.
First release of HOGENOM, a database of homologous genes from complete genome Equipe Bioinformatique et Génomique Evolutive Laboratoire de Biométrie et.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Bioinformatics Lecture 2. Bioinformatics: is the computational branch of molecular biology Using the computer software to analyze biological data The.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Alternative splicing and evolution Daniel Jeffares.
Prepared with lots of help from friends... Metsada Pasmanik-Chor, Zohar Yakhini and NUMEROUS WEB RESOURCES. BioInformatics / Computational Biology Introduction.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
EVOLUTIONARY AND COMPUTATIONAL GENOMICS Shin-Han Shiu Plant Biology / CMB / EEBB / Genetics / QBMI.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Comparative Genomics of the Eukaryotes
Genome projects and model organisms Level 3 Molecular Evolution and Bioinformatics Jim Provan.
Gene Structure and Identification
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Chapter 5 Genome Sequences and Gene Numbers. 5.1Introduction  Genome size vary from approximately 470 genes for Mycoplasma genitalium to 25,000 for human.
Genomes School B&I TCD Bioinformatics May Genome sizes Completed eukaryotic nuclear genomes Type of organismSpeciesGenome size (10 6 base pairs)
CHMI E.R. Gauthier, Ph.D. 1 CHMI 2227E Biochemistry I Gene expression.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
This presentation was originally prepared by C. William Birky, Jr. Department of Ecology and Evolutionary Biology The University of Arizona It may be used.
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
HUMAN-MOUSE CONSERVED COEXPRESSION NETWORKS PREDICT CANDIDATE DISEASE GENES Ala U., Piro R., Grassi E., Damasco C., Silengo L., Brunner H., Provero P.
Chapter 10 Transcription RNA processing Translation Jones and Bartlett Publishers © 2005.
Proteome and interactome Bioinformatics.
Fig.1.8 DNA STRUCTURE 5’ 3’ Antiparallel DNA strands Hydrogen bonds between bases DOUBLE HELIX 5’ 3’
Genetics 3: Transcription: Making RNA from DNA. Comparing DNA and RNA DNA nitrogenous bases: A, T, G, C RNA nitrogenous bases: A, U, G, C DNA: Deoxyribose.
© 2015 W. H. Freeman and Company CHAPTER 1 The Genetics Revolution Introduction to Genetic Analysis ELEVENTH EDITION Introduction to Genetic Analysis ELEVENTH.
Comparative genomics analysis of NtcA regulons in cyanobacteria: Regulation of nitrogen assimilation and its coupling to photosynthesis Wen-Ting Huang.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Proposed redefinition of “gene” requires it to have a biological role Gerstein MB, …, Snyder M Genome Res 17: example of complexities observed.
Comparative genomics Haixu Tang School of Informatics.
From Genomes to Genes Rui Alves.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Using blast to study gene evolution – an example.
MCB 7200: Molecular Biology Biotechnology terminology Common hosts and experimental organisms Transcription and translation Prokaryotic gene organization.
Transcription in Prokaryotic (Bacteria) The conversion of DNA into an RNA transcript requires an enzyme known as RNA polymerase RNA polymerase – Catalyzes.
David Sadava H. Craig Heller Gordon H. Orians William K. Purves David M. Hillis Biologia.blu B – Le basi molecolari della vita e dell’evoluzione The Eukaryotic.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Johnson - The Living World: 3rd Ed. - All Rights Reserved - McGraw Hill Companies Genomics Chapter 10 Copyright © McGraw-Hill Companies Permission required.
How many genes are there?
Lecture 21 – Genome Annotation & Sequenced Genomes Based on Chapther 8 Genomics: The Mapping and Sequencing of Genomes Copyright © 2010 Pearson Education.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Gene models and proteomes for Saccharomyces cerevisiae (Sc), Schizosaccharomyces pombe (Sp), Arabidopsis thaliana (At), Oryza sativa (Os), Drosophila melanogaster.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Eukaryotic genes are interrupted by large introns. In eukaryotes, repeated sequences characterize great amounts of noncoding DNA. Bacteria have compact.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
MCB 7200: Molecular Biology
The Transcriptional Landscape of the Mammalian Genome
Genetics and Evolutionary Biology
Basics of Comparative Genomics
EL: To find out what a genome is and how gene expression is regulated
Eukaryotic Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Evolution of eukaryote genomes
Functional Impact of Transposable Element using Bioinformatic Analysis
Chapter 4 The Interrupted Gene.
The Structure of the Genome
Basics of Comparative Genomics
Presentation transcript:

Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) Genome analysis Bioinformatics

Contents Genome annotation Comparative genomics  Phylogenetic profiles  Gene fusion analysis  Phylogenetic footprinting

Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) From sequences to genomes Bioinformatics

From sequences to genomes Before the 1990’s, DNA sequencing represented an important investment in terms of human work. A PhD student could spend a significant fraction of his thesis to sequence a single gene. Genome projects stimulated the development of automatic sequencing methods, and led to important technological improvement. There are currently (2008) several hundreds of publicly available fully sequenced genomes.  The NCBI genome distribution (ftp://ftp.ncbi.nih.gov/genomes/) containsftp://ftp.ncbi.nih.gov/genomes/ >650 prokaryotes (Bacteria and Archaea) Insects (Drosophila melanogaster, Apis mellifera) Plants (Arabidopsis thaliana, rice, maize) A worm (Caenorhabditis elegans) Some fungi (Saccharomyces cerevisiae, Schizosaccharomyces pombe, … ) Some mammals (Homo sapiens, Mus musculus, Rattus norvegicus)  Other genome centres give acces to other genomes. ENSEMBL ( maintains many vertebrate genomeshttp:// UCSC ( maintains genomes of metazoan + insectshttp://genome.ucsc.edu/ Sanger Institute ( Integr8 ~800 of genomes in Many other genomes were sequenced by commercial companies, and are not available to the public.

Gene organization Source: Mount (2000)

Gene function After having localized genes on the sequence, we have to predict their function. Some genes have already been characterized before the genome project, but these are generally a minority of those found in the genome. For the majority of the genes, one tries to predict function on the basis of similarities between the sequence of the newly sequenced gene and some previously known genes (function assignation by sequence similarity). Example: yeast genome (1996): there are still 2500 genes (39%) whose function is completely unknown. However  Yeast is among the best known model organisms (genetics, molecular biology).  The full genome is available since When the first traft of the Human genome has been published, 60% of the predicted genes were of unknwown function. >PHO4,SPBC428.03C : THIAMINE-REPRESSIBLE ACID PHOSPHATASE PRECURSOR : Q01682;Q9UU70; Length = 463 Score = 161 bits (408), Expect = 1e-40 Identities = 138/473 (29%), Positives = 223/473 (46%), Gaps = 47/473 (9%) Query: 9 ILAASLVNAGTIPLGKLSDIDKIGTQTEIFPFLGGSGPYYSFPGDYGISRDLPESCEMKQ 68 +LAAS+V+AG S + + LG Y+ P G + PESC +KQ Sbjct: 10 LLAASIVHAGK------SQFEAFENEFYFKDHLGTISVYHE-PYFNGPTTSFPESCAIKQ 62 Query: 69 VQMVGRHGERYPT VSKAKSIMTTWYKLSNYTGQFSGALSFLNDDYEFFIRDTK 121 V ++ RHG R PT VS A+ I KL N G S+ + F T Sbjct: 63 VHLLQRHGSRNPTGDDTATDVSSAQYIDIFQNKLLN--GSIPVNFSYPENPLYFVKHWTP 120 Query: 122 NLEMETTLANSVNVLNPYTGEMNAKRHARDFLAQYGYMVENQTSFAVFTSNSNRCHDTAQ E S + G + R +Y Y T+ R D+A+ Sbjct: 121 VIKAENADQLSSS------GRIELFDLGRQVFERY-YELFDTDVYDINTAAQERVVDSAE 173 Query: 182 YFIDGL-GDKFN--ISLQTISEAESAGANTLSAHHSCPAWDDDVNDDILKK-----YDTK 233 +F G+ GD + + E +SAGAN+L+ ++SCP ++D+ D+ + + Sbjct: 174 WFSYGMFGDDMQNKTNFIVLPEDDSAGANSLAMYYSCPVYEDNNIDENTTEAAHTSWRNV 233 Query: 234 YLSGIAKRLNKE-NKGLNLTSSDANTFFAWCAYEINARGYSDICNIFTKDELVRFSYGQD 292 +L IA RLNK + G NLT SD + + C YEI R SD C++FT E + F Y D Sbjct: 234 FLKPIANRLNKYFDSGYNLTVSDVRSLYYICVYEIALRDNSDFCSLFTPSEFLNFEYDSD 293 Query: 293 LETYYQTGPGYDVVRSVGANLFNASVKLLKE--SEVQDQKVWLSFTHDTDILNYLTTIGI 350 L+ Y GP + ++G N L++ + D+KV+L+FTHD+ I+ +G Sbjct: 294 LDYAYWGGPASEWASTLGGAYVNNLANNLRKGVNNASDRKVFLAFTHDSQIIPVEAALGF 353 Query: 351 IDDKNNLTAEH-VPFMENTF----HRSWYVPQGARVYTEKFQCS-NDTYVRYVINDAVVP 404 D +T EH +P +N F S +VP + TE F CS N YVR+++N V P Sbjct: 354 FPD---ITPEHPLPTDKNIFTYSLKTSSFVPFAGNLITELFLCSDNKYYVRHLVNQQVYP 410 Query: 405 IETCSTGPGFS----CEINDFYDYAEKRVAGTDFLKVCNVSSVSNSTELTFFW C GP + CE N ++ST +T ++ Sbjct: 411 LTDCGYGPSGASDGLCELSAYLNSSVRVNSTSNGIANFNSQCQAHSTNVTVYY 463

Some milestones

Genes and genome size In prokaryotes, the number of genes increases linearly with genome size In eukaryotes, this is not the case: the genome size increases faster than the number of genes

Genes and genome size Beware: the axes are logarithmic. This plot represents the same data as the previous one, but in logarithmic scale, in order to see Mammals as well.

Gene spacing Gene spacing increases considerably with the complexity off the organisms. Note: the X axis si logarithmic, not the Y axis -> the increase seems grossly exponential.

Proportion of intergenic regions Beware: the X axis is logarithmic. The proportion of intergenic regions increases with the complexity of an organism. In addition (not shown here), introns represent an increasing fraction of the genome. For example, the exonic fraction represents <5% of the human genome.

Protein size versus genome size Protein sequences are shorter in prokaryotes than in eukaryotes. Among eukaryotes, the increase in genome size is not correlated to an increase in protein size  higher eukaryotes have a much larger genome than fungi, without increase in protein size

Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) Genome annotation Bioinformatics

Gene prediction Starting from a completely sequenced genome, predict the positions of genes Elements of prediction  Open Reading Frames Start and stop codons, separated by a a continuous set of non-stop codons.  Region content Hexanucleotide composition Codon adaptation index (CAI).  Signals In prokaryotes: Shine-Delgarno boxes. In eukaryotes: intron/exon boundary elements (splicing signals).  Similarity with known genes.

Gene prediction - limitations Typical problems:  Gene prediction programs are trained for a specific organism, and can give very bad results with other organisms (e.g., the first rounds of annotations of A.thaliana were done with programs trained for mammals).  Any gene prediction program will unavoidably predict false genes, and miss some true genes.  The prediction of intron/exon boundaries is particularly difficult.  For prokaryotes, the predicted start codons are sometimes imprecise. Example: genome of the yeast Saccharomyces cerevisiae  For the yeast genomes, the gene detection protocol used in 1996 was over-predictive.  The program essentially relied on ORF, and predicted 6400 gene.  Some researchers estimated that ~1,000 ORFs might be false predictions.  Since 1996, the reality of the predicted genes has been tested by combining several methods of functional genomics (expression studies, mutant phenotypes, comparative genomics between closely related species, …).  A few hundreds of the initially predicted genes have been removed from the annotations.

Non-coding genes There are many types of non-coding genes  tRNAtransfer RNA  rRNAribosomial RNA  snRNAsmall nuclear RNA (elements of spliceosome)  snoRNAmethylation guides ... Detection of non-coding RNA  generally transcribed by polymerase I and III and have different promoters

Annotation of gene function Once a genomic region has been predicted to contain a gene, the next step is to predict the function of this gene. The translated product is compared with all known proteins, and a putative function can be assigned on the basis of high similarity matches. Problems  Sequence similarity is not always sufficient to confer the same function  Where to put the threshold ?  Some proteins might have similar function with different sequences (convergent evolution).  Once a gene has been assigned some putative function, this will be used to assign the same function to other genes  expansion of errors. We should thus be aware that gene annotations have to be taken with caution.

Genes with unknown function When genomes of model organisms were sequenced, about 40% of the predicted genes could not be associated to any known function These genes are annotated as "hypothetical proteins". Note  In the yeast genome, many of these hypothetical proteins have been removed from the annotations since 1996, because they were false predictions.

Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) Comparative genomics Bioinformatics

Phylogenetic footprinting One of the main reasons for sequencing the mouse genome was to detect conserved regions between mouse and human, which will reveal exons and regulatory regions.  The fact that an unknown gene is found in different genomes gives more confidence in the existence of this gene. Another important goal was to detect conserved regions in non-coding regions.  On the basis of a few known cases, it has been shown that conserved non-coding regions contain a high concentration in regulatory elements.  The detection of conserved non-coding sequences gives thus indications about regions potentially involved in regulation.  Such conserved regions are called phylogenetic footprints. Genome 1 Genome 2 conserved non-coding regionconserved exon

Phylogenetic profiles For each gene of the query genome (e.g. E.coli), orthologs are searched in all the sequenced genomes Each gene is characterized by a profile of presence/absence in all the sequenced genomes Groups of genes having similar phylogenetic profiles are likely to be functionally related Pellegrini et al. (1999). Proc Natl Acad Sci U S A 96(8),

Gene fusion analysis It is quite frequent to observe that two genes of a given organism are fused into a single gene in another organism. Fusions between more than 2 genes are occasionally observed. Fused genes are likely to be functionally related. Query genome E.coli 5 components Yeast 1 composite Reference genomes ABCDE C^D^A^B^E Query genome E.coli 2 components B.subtilis 1 composite Reference genomes H.pylori 1 composite AB A^B References Marcotte, et al. (1999). Science 285(5428), Marcotte, et al. (1999). Nature 402(6757), Enright, et al. (1999). Nature 402(6757),

Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) Conclusion Bioinformatics

The genome challenge Despite the availability of several hundreds of genomes, we are far from understanding the organization and function of a single genome. In particular, a lot of work remains to be done to decipher genomes of higher organisms. Genome sequence by itself is far from sufficient for this. Since 1997, several high-throughput methods have been invented to give complementary information about gene function (see courses on transcriptome, proteome and interactome).

Quelques jalons