Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and.

Slides:



Advertisements
Similar presentations
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Profiles for Sequences
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
Computational Gene Finding
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Identification Lab
Introduction to BioInformatics GCB/CIS535
Gene Finding Charles Yan.
CSE182-L10 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Genome Annotation and the landscape of the Human Genome Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Gene Structure and Identification
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Genome Annotation Rosana O. Babu.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Applied Bioinformatics
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Identification of Coding Sequences Bert Gold, Ph.D., F.A.C.M.G.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
What is BLAST? Basic BLAST search What is BLAST?
bacteria and eukaryotes
What is a Hidden Markov Model?
Basics of Comparative Genomics
Eukaryotic Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Gene Structure and Identification
Basics of Comparative Genomics
Genome Annotation and the Human Genome
Basic Local Alignment Search Tool
Presentation transcript:

Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and end point, can occur at different levels Addition of as much reliable and up-to-date information as possible to describe a sequence Identification, structural description, characterization of putative protein products and other features in primary genomic sequence

Genome annotation Two main levels: 1.Structural annotation = Nucleotide-Protein level annotation – Finding genes and other biologically relevant sites thus building up a model of genome as objects with specific locations 2.Functional annotation – Objects are used in database searches (and experiments) aim is attributing biologically relevant information to whole sequence and individual objects Large-scale genome analysis projects Rate-limiting step is annotation

Structural annotation: Gene prediction This step consist in identifying the coding genes in the DNA sequence. Properties of coding genes that can be used for to detect them on a genomic sequence are numerous.

Gene Prediction Programs Factors based Compositional bias found in protein-coding regions Similarity with known sequences But not accurate enough, without cDNA sequence data Prediction = highly hypothetical

Three types of information are used in predicting gene structures Signal measures: "signals" in the sequence, such as Splice sites. Searching by signal- The analysis of sequence signals that are potentially involved in gene specification The most important features to identify are the splice junctions-the donor and acceptor sites. Other signals like TATA boxes, transcription factor (TF) binding sites, and CpG islands are also taken into consideration for accurate gene prediction. Poly(A) addition signals are also sometimes used for gene C-terminal identification.

b) Content measures: "content" statistics, such as codon bias. -"Content" statistics help to distinguish coding from noncoding regions. -Searching by content-The analysis of regions showing compositional bias that has been correlated with coding regions c) Similarity measures: similarity to known genes. -A region of genomic DNA that is significantly similar to a known sequence will usually have the same, or very similar, function. - Methods based on the comparison of the genomic sequence with known coding sequences BLASTx (Gish & States 1993) - ORFs in prokaryotic genomes: useful

d) Homolog-based gene prediction Comparing sequences of interest against known coding sequences e) Comparative gene prediction Comparing sequences of interest anonymous genomic sequences Example Extrinsic or look-up gene prediction Gene structure is predicted through comparison with other sequences whose characteristics are already known

Prokaryotic genes By single open reading frames (ORFs) Usually found adjacent to one another Eukaryotic genes Coding sequences (the exons) are interrupted by large, noncoding introns

Algorithms for gene prediction Most of the gene prediction programs use statistics based measures for identifying protein-coding regions. However these programs generate enormous models of gene structure for even a short DNA sequence. This problem has been resolved through the use of Dynamic Programming to generate the highest possible gene model without examining all possible ones. HMM (Hidden Markov Model) based are becoming more popular lately due to many reasons. The models have an intuitive analogy to the things that are being modelled-in this case, gene structures. They have A consistent mathematical formalism that allows for rigorous analysis.

Basic steps followed in gene prediction methods

Gene Prediction in Eukaryotes 1.Identifying and scoring suitable Splice sites, start & stop signals along the query sequence 2. Predicting candidate exons As deduced through the detection of these signals 3. Scoring these exons as a function of both The signals used to detect the exons, as well as on Coding statistics computed on the putative exon sequence itself. In homology-based & comparative methods-Exon scores factor in the quality of the alignment between the query sequence and either known coding sequences or anonymous genomic sequences 4. Assembling a subset of these candidates into a predicted gene structure- To maximize a particular scoring function Dependent on the score of each of the individual exon candidates that comprise the overall predicted gene structure

Prediction of Genes Through 1. Coding Statistics 2. Statistics-based (ab initio) Methods 3. Homology-based Methods 4. Combination Tools 5. Comparative Approaches

Coding statistics Coding regions of the sequence have different properties than non-coding regions: non random properties of coding regions. – GC content – Codon bias (CODON FREQUENCY). – Third base composition (every third base in a coding region tends to be the same one much more often than by chance alone) (TESTCODE).

 Hidden Marko Model:  Neural Networks:  Decision Tree:  Integration of Various Statistical Approaches: GenScan, Genie, Genemark, Veil, HMMgene, GeneID Grail II, GrailEXP_Perceval FGENESH MZEF, MZEF-SPC Statistics-based (ab initio) Methods Ab initio (from the beginning) approach predict genes directly using the computational properties of exons, introns, and other features in the genomic sequences without the reference of the experimental data.

Ab initio Gene Prediction Good at predicting coding nucleotides (> 90%) Moderately good at finding exon boundaries (70-75% correct per exon (< initial & final exons)) Poor at predicting complete gene structures <50% predicted genes correspond to actual genes Difficulty is location of intron-exon boundaries Easier for simpler organisms (prokaryotes) Improve accuracy by combining methods Different methods often predict different elements of an actual gene Could complement each other yielding better prediction

GenScan (J. Mol. Biol., 268, 78-94, 1997) exons, introns, promoter and poly- adenylation vertebrate, nematode (experimental), maize and Arabidopsis GENSCAN is a program designed to predict complete gene structures, including exons, introns, promoter and poly- adenylation signals, in genomic sequences. It differs from the majority of existing gene finding algorithms in that it allows for partial genes as well as complete genes and for the occurrence of multiple genes in a single sequence, on either or both DNA strands. Program versions suitable for vertebrate, nematode (experimental), maize and Arabidopsis sequences are currently available. The vertebrate version also works fairly well for Drosophila sequences. ( (

GenScan (J. Mol. Biol., 268, 78-94, 1997) GenScan is substantially more general in the earlier tools (e.g., Genemark) Single as well as multi-exon genes Promoters, polyadenylation signals, and intergenic sequences Genes occurring on either or both strands

GenScan (J. Mol. Biol., 268, 78-94, 1997) Limitations of GenScan  Greatly high level of overprediction (~50%)  Without the capacity of alternative splicing detection  Organism: Human/vertebrates (more accurate)  Internal exons are predicted more accurately than initial or terminal exons, exons are predicted more accurately than polyadenylation or promoter signals.

GRAIL Gene finder for human, mouse, arabidopsis, drosophila, E. coli Based on neural networks Masks human and mouse repetetive elements Incorporates pattern-based searches for several types of promoters and simple repeats Accuracy in 75-95% range

Homology-based Methods Homological approach identifies genes with the aid of experimental data. This approach exploits the alignment gene sequence between genomic data and the known cDNA (or protein) database. 1. Local Alignment Methods (BLAST-based): 2. Pattern-based Alignment Method: AAT, GAIA, INFO Flash, ICE, CRASA

Time Consuming (CPU time) Storage Noise Removing Exon Boundaries Difficulties of Homology-based Methods

Combination Tools These tools combine both sequence similarity and ab initio gene finding approaches. They predict genes by producing a splicing alignment between a genomic sequence and a candidate amino acid sequence. 1. Procrustes 2. GeneWise 3. GenomeScan 4. FGENESH+ & FGENESH++ 5. GrailEXP_Gawain and _GALAHAD

1. Procrustes 2. GeneWise 3. GenomeScan 4. FGENESH+ & FGENESH++ 5. AAT (+GSA2) Genomic Seq. Protein or cDNA DB Blast alignment Predicted genes Target protein ab initio prediction Combination Tools

All difficulties of homology-based methods Time Consuming (CPU time) Storage Noise Removing Exon Boundaries Difficulties of Combination Tools Drawback:  Not every BLAST hit presents true homology (maybe BLAST false positive or pseudogene)  A set of thresholds

Comparative Approaches These tools predict that the conserved genomic sequences from other vertebrates are likely to be genes. Gene features (e.g., splice sites) that are conserved in both species can be given special credence, and partial gene models (e.g., pairs of adjacent exons) that fail to have counterparts in both species can be filtered out.  TWINSCAN (BlastN + GENSCAN) ‚SLAM DOUBLESCAN  SGP-1 and –2 (GENEID + TBlastX) ROSETTA program  CEM program  Ka/Ks ratio test (MegaBlast + nonsynonymous vs. synonymous)  PSEP

Comparative Methods Big Challenge of these methods: codingnon-coding The conserved sequences include coding and non-coding regions.

What is an HMM? An HMM describes the probability of transition between the hidden states of a model. The probability that one base pair is in one particular state depends on the state of the previous base pair. The transition probabilty to another state depends on the appearance of a transition signal (splice site) and/or the average number of bp in a certain hidden state (size of Exon/Introns). Hidden Markov Models (HMM) for gene prediction

Neural Networks for gene prediction (1) What are Neural Networks? – Neural Network is a computer program that given a training set of data that preserve certain pattern learn to recognize given pattern. – The name derives from the fact that originally they ware intended to imitate human brain. – Like a brain cells, neural networks consists of central decision making unit connected to other units with the same topology.

Assessing performance: Sensitivity and Specificity Testing of predictions is performed on sequences where the gene structure is known Sensitivity is the fraction of known genes (or bases or exons) correctly predicted. Specificity is the fraction of predicted genes (or bases or exons) that correspond to true genes.

Some examples of the methods used: 1) Similarity searches: GRAIL-II, GENQUEST 2) Statistical / compositional bias: SorFind, HEXON, XPOUND, MZEF, GRAIL-II 3) Heuristic rule based systems: GeneID 4) Linguistic methods: GenLang 5) Linear discriminant analysis (LDA): HEXON, FGENEH 6) Decision tree: MORGAN 7) Dynamic programming: GeneParser, MORGAN, GREAT, GENVIEW, GAP-III 8) Markov models: ECOPARSE, VEIL, GENIE, GENSCAN, GENEMARK 9) Spliced alignment: PROCRUSTES 10) Quadratic discriminant analysis: MZEF

FGENESH FGENES=“Find genes” Linear discriminant analysis Splice sites, exons, promoter elements 1st version: Solovyev et al Linear discriminant analysis to identify splice sites, exons, And promoter elements Filtered exons are assembled using a dynamic programming Algorithm that searches paths of compatible exons, with the goal of maximizing the final gene score FGENESH ● HMM-based version ● FGENESH + FGENESH-C incorporate protein and cDNA homology ● Perfrom better than ab initio than FGENESH

Procruste PROCRUSTES is based on the spliced alignment algorithm which explores all possible exon assemblies and finds the multi-exon structure with the best fit to a related protein PROCRUSTES successfully recognizes genes with short exons as well as complicated genes with more than 20 exons. Test results demonstrate that the spliced alignment algorithm provides 99% accurate recognition of a mammalian gene if a related gene from another mammalian species is known