Bioinformatics as Hard Disk Investigation Assuming you can read all the bits on a 1000 year old hard drive Can you figure out what does what? - Distinguish.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

The Central Dogma DNA  RNA  Protein  Function Replication
Biological Motivation Gene Finding
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
An Introduction to Bioinformatics Finding genes in prokaryotes.
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Identification Lab
Computational Biology, Part 4 Protein Coding Regions Robert F. Murphy Copyright  All rights reserved.
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Finding prokaryotic genes and non intronic eukaryotic genes
Hidden Markov Models In BioInformatics
Gene Structure and Identification
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Applied Bioinformatics
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Annotation of eukaryotic genomes
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Pairwise Sequence Alignment
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Bacterial infection by lytic virus
ORF Calling.
bacteria and eukaryotes
Bacterial infection by lytic virus
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Reading Frames and ORF’s
Microbial gene identification using interpolated Markov models
Presentation transcript:

Bioinformatics as Hard Disk Investigation Assuming you can read all the bits on a 1000 year old hard drive Can you figure out what does what? - Distinguish program section (gene?) - Distinguish overwritten fragments (junk dna?) - Uncompress compressed data (???) - Detect “clever” programmer tricks (???)

That’s too easy! How do you read the bits of the hard drive? How do you know to read bits and in what order? A more accurate analogy requires the hard drive to incorporate information about the computer, enough to enable reproduction.

Further Complications Are all the programs active? Under what circumstances do they become active? Can some programs control other programs? (promoters/suppressors) Can some programs modify other programs? Can some programs change the rules of interpretation?

A Summary of Bioinformatics Given a genome - Figure out what parts do what - What are the rules? - What changes what? Under what circumstances? - What changes the rules? How? Why? Are there any steadfast rules? - The laws of physics - The laws of chemistry

Gene Identification Lab Shuba Gopal Biology Department Rochester Institute of Technology and Rhys Price Jones Computer Science Department Rochester Institute of Technology

Gene Identification involves: Locating genes within long segments of genomic sequence. Demarcating the initiation and termination sites of genes. Extracting the relevant coding region of each gene. Identifying a putative function for the coding region.

Outline of Session Quick review of genes, transcription and translation Gene finding in prokaryotes Some prokaryotic gene finders Improving on ORF finding Gene finding in eukaryotes Some eukaryotic gene finders

Defining the Gene What is the unit we call a gene? - A region of the genome that codes for a functional component such as an RNA or protein. We'll focus on protein-coding genes for the remainder of this session. A gene can be further divided into sequence elements with specific functions. Genes are regulated and expressed as a result of interactions between sequence elements and the products of other genes.

Schematic of a gene

Finding Genes in Genomes Gene = Coding region What defines a coding region? - A coding region is the region of the gene that will be translated into protein sequence. Is there such a thing as a canonical coding region? Objective : Identify coding regions computationally from raw genomic sequence data.

Coding Regions as Translation Regions Translation utilizes a trinucleotide coding system: codons. Translation begins at a start codon. Translation ends at a stop codon.

Some Important Codons Most organisms use ATG as a start codon. - A few bacteria also GTG and TTG - Regardless of codon used, the first amino acid in every translated peptide chain is methionine. However, in most proteins, this methionine is cleaved in later processing. So not all proteins have a methionine at the start. Almost all organisms use TAG, TGA and TAA as stop codons. - The major exception are the mycoplasmas.

The Degenerate Code Of the other 60 triplet combinations, multiple codons may encode the same amino acid. - E.g. TTT and TTC both encode phenylalanine Organisms preferentially use some codons over others. This is known as codon usage bias. - The age of a gene can be determined in part by the codons it contains. Older genes have more consistent codon usage than genes that have arrived recently in a genome.

Identifying Genes in Genomes Organisms utilize a variety of mechanisms to control the transcription and expression of their genes. Manipulating gene structure is one such method of control. - Coding regions can be in contiguous segments, or - They may be divided by non-coding regions that can be selectively processed.

Understanding the Tree of Life There are three major branches of the tree: - Bacteria (prokaryote) - Archaea (prokaryote) - Eukaryotes

Coding Regions in Prokaryotes In bacteria and archaea, the coding region is in one continuous sequence known as an open reading frame (ORF).

Coding Regions in Prokaryotes DNA: ATG-GAA-GAG-CAC-CAA-GTC-CGA-TAG Protein: MET-GLU- GLU -HIS -GLN-VAL-ARG-Stop

Where's Waldo (the Gene)? Time for some fun - design your own prokaryote gene finder. Follow the lab exercises to identify regions of the E. coli genome that might contain ORFs.

Some Gene Finders in Prokaryotes Because the translation region is contiguous in prokaryotes, gene finding focuses primarily on identifying ORFs. - ORF-finder takes a syntactic approach to identifying putative coding regions. ORF-finder is available from NCBI. - GLIMMER 2.0 is a more sophisticated program that attempts to model codon usage, average gene length and other features before identifying putative coding regions. GLIMMER 2.0 is available from TIGR.

ORF-Finder Approach - Identify every stop codon in the genomic sequence. - Scan upstream to the farthest, in-frame start codon. Will locate ORFs that begin with ATG as well as GTG and TTG - Label this an ORF. Output - List all ORFs that exceed a minimum length constraint.

ORF-Finder The black lines represent each of the three reading frames possible on one strand of DNA. The gray boxes each represent a putative ORF.

ORF-Finder Advantages - Can identify every possible ORF. - Minimum length constraint ensures that many false positives are discarded prior to human review. Disadvantages - Does not eliminate overlapping ORFs. - Even with a length constraint, there are often many false positives. - Cannot take into account organism- specific idiosyncrasies

ORF-Finder Example In this example, there are seven possible ORFs. However, only ORF D and G are likely to be coding. The others may be eliminated because they are: - Too small ORFs A, C and E - Overlap with other ORFs, ORFs B, C and F - Have extremely unusual codon composition.

Glimmer 2.0 Approach - Build an Interpolated Markov Model (IMM) of the canonical gene from a set of known genes for the organism of interest. - The model includes information about: Average length of coding region Codon usage bias (which codons are preferentially used) Evaluates the frequency of occurrence of higher order combinations of nucleotides from 2 through 8 nucleotide combinations.

Glimmer 2.0 Output - For each ORF, GLIMMER assigns a likelihood score or probability that the ORF resembles a known gene. - High scoring ORFs that overlap significantly with other high scoring ORFs are reported but highlighted. GLIMMER 2.0 is reported to be 98% accurate on prokaryotic genomes.

Glimmer 2.0 Advantages: - Fewer false positives because ORFs are evaluated for likelihood of coding. - Organism-specific because model is built on known genes. - User can modify many parameters during search phase. Disadvantages: - Requires approximately 500+ known genes for proper training. - Genuine coding regions with unusual codon composition will be eliminated. - Reported accuracy difficult to reproduce.

Other features of prokaryotic genes While the ORF is the defining feature of the coding region, there are other features we can use to identify true coding regions. We can improve accuracy by: - Identifying control regions Promoters Ribosome binding sites - Characterizing composition CpG islands Codon usage

Schematic of a gene

Characterizing Promoters A promoter is the DNA region upstream of a gene that regulates its expression. - Proteins known as transcription factors bind to promoter sequences. - Promoter sequences tend to be conserved sequences (strings) with variable length linker regions. - Ab initio identification of promoters is difficult computationally. A database of known, experimentally characterized promoters is available however.

Ribosome binding sites The ribosome binding site (RBS) determines, in part, the efficiency with which a transcript is translated. Ribosome binding sites in prokaryotes are relatively short, conserved sequences and have been characterized to some extent. - Eukaryotic ribosome binding sites are more variable and not as well characterized. - They may also not be conserved from one organism to another.

E. coli RBS Consensus Sequence

Genomic Jeopardy! Compare your list of predicted ORFs from the E. coli genome with the verified set from GenBank. - How well did your gene finder perform? - Follow the lab exercises to evaluate your gene finder.

Characterizing composition Codon usage (preferential use of certain codons over others) can be modelled given sufficient data on known genes. - This is part of Glimmer's approach to gene identification. Gene rich regions of the genome tend to be associated with CpG islands. - Regions high in G+C content - Multiple occurrences of CG dinucleotides. - These can be modelled as well.

Summary: Prokaryote Gene Finding Prokaryotic coding regions are in one contiguous block known as an open reading frame (ORF). Identifying an ORF is just the first step in gene finding. The challenge is to discriminate between true coding regions and non-coding ORFs. - Using information from promoter analysis, RBS identification and codon usage can facilitate this process.

Coding Regions in Eukaryotes In eukaryotes, the coding regions are not always in one block.

Coding Regions in Eukaryotes DNA: ATG-GAA-GAG-CAC- *GTTAACACTACGCATACAG* -CAA-GTC-CGA-TAG Protein: MET-GLU-GLU-HIS-GLN-VAL-ARG-Stop

Gene Finders in Eukaryotes Tools for finding genes in eukaryotes - Genie uses information from known genes to guess what regions of the genome are likely to contain new genes. - Fgenes is very good at finding exons and reasonably accurate at determining gene structure. - Genscan is one of the most sophisticated and most accurate.

Genie Approach - Apply a pre-built Generalized Hidden Markov Model (GHMM) of the canonical eukaryotic (mammalian) gene. - The model includes information about: Average length of exons and introns. Compositional information about exons and introns. A neural-net derived model of splice junctions and consensus sequences around splice junctions. - Splice junction information can be further improved by including results of homology searches.

Genie Output - Likelihood scores for individual exons - The set of exons predicted to be associated with any given coding region. - Information regarding alignment of the predicted coding region to known proteins from homology searching. Genie is approximately 60-75% accurate on eukaryotic genomes.

Genie Example Actual gene structure: Initial Prediction by Genie:

Genie Example Sequence homology alignments: Corrected prediction:

Genie Advantages: - Extraneous predicted exons can be eliminated based on evidence from homology searches. - Likelihood scores provided for each predicted exon. Disadvantages: - No organism-specific training is possible. - Works best on mammalian genomes, not other eukaryotes. - Reliance on homology evidence can result in oversight of novel genes unique to the organism of interest.

Fgenes Approach - Identifies putative exons and introns. - Scores each exon and intron based on composition. - Uses dynamic programming to find the highest scoring path through these exons and introns. The best-scoring path is constrained by several factors including that exons must be in frame with each other and ordered sequentially.

Fgenes Output - Gene structure derived from best path through putative exons and introns. - Alternative structures with high scores. Fgenes is about 70% accurate in most mammalian genomes.

Fgenes Example Actual gene structure: Initial predicted exons and scores:

Fgenes Example Initial gene structure prediction: Final gene structure prediction:

Fgenes Advantages: - Alternative gene structures are reported. - Also attempts to identify putative promoter and poly-A sites. Disadvantages: - User cannot train models. - Only human model- based version is available for unrestricted public use.

Genscan Approach - Models for different states (GHMMs) State 1 and 2: Exons and Introns - Length - Composition - State 3: Splice junctions Weight matrix based array to identify consensus sequences Weight matrix to identify promoters, poly-A signals and other features.

Genscan Output - Gene structure - Promoter site - Translation initiation exon - Internal exons - Terminal exon (translation termination) - Poly-adenylation site Genscan is 80% accurate on human sequences.

Genscan Advantages: - Most accurate of available tools. - Excellent at identifying internal and terminal exons - Provides some assistance in identifying putative promoters Disadvantages: - User cannot train models nor tweak parameters. - Identification of initial exons is weaker than other kinds of exons. - Promoter identification can be mis-leading.

Summary for Eukaryote Gene Finding Eukaryotic gene structures can be quite complex. The best approaches to gene finding in eukaryotes combine probabilistic methods with heuristics to yield reasonable accuracy. - But even in the best case scenario, accuracy is only about 80%.

Resources for Gene Finding For the most recent comparison of gene finding tools, check the Banbury Cross pages: - fr/igs/banbury/ Other resources are available at: - NCBI - TIGR - Sanger Institute