Applied Bioinformatics

Slides:



Advertisements
Similar presentations
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Advertisements

BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
An Introduction to Bioinformatics Finding genes in prokaryotes.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Profiles for Sequences
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Computational Gene Finding
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Identification Lab
Introduction to BioInformatics GCB/CIS535
Gene Finding Charles Yan.
Comparative ab initio prediction of gene structures using pair HMMs
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Gene Structure and Identification
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Gene prediction. Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc.
Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Predictive methods using DNA sequences Unit 11 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.
Genome Annotation Rosana O. Babu.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
CS5238 Combinatorial methods in bioinformatics
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Genome Annotation Assessment in Drosophila melanogaster by Reese, M. G., et al. Summary by: Joe Reardon Swathi Appachi Max Masnick Summary of.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Identification of Coding Sequences Bert Gold, Ph.D., F.A.C.M.G.
Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Gene Prediction (cont’d)
Interpolated Markov Models for Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Presentation transcript:

Applied Bioinformatics Week 6

Theoretical Part I Gene Prediction

What is Computational Gene Finding? Given an uncharacterized DNA sequence, find out: Which region codes for a protein? Which DNA strand is used to encode the gene? Which reading frame is used in that strand? Where does the gene starts and ends? Where are the exon-intron boundaries in eukaryotes? (optionally) Where are the regulatory sequences for that gene? Sanja Rogic CS Department UBC Computational Gene Finding

Prokaryotic Vs. Eukaryotic Gene Finding Prokaryotes: small genomes 0.5 – 10·106 bp high coding density (>90%) no introns Gene identification relatively easy, with success rate ~ 99% Problems: overlapping ORFs short genes finding TSS and promoters Eukaryotes: large genomes 107 – 1010 bp low coding density (<50%) intron/exon structure Gene identification a complex problem, gene level accuracy ~50% Problems: many Sanja Rogic CS Department UBC Computational Gene Finding

Computational Gene Finding Gene Structure Sanja Rogic CS Department UBC Computational Gene Finding

Gene Finding: Different Approaches Similarity-based methods (extrinsic) - use similarity to annotated sequences: proteins cDNAs ESTs Comparative genomics - Aligning genomic sequences from different species Ab initio gene-finding (intrinsic) Integrated approaches Sanja Rogic CS Department UBC Computational Gene Finding

Similarity-based methods Based on sequence conservation due to functional constraints Use local alignment tools (Smith-Waterman algo, BLAST, FASTA) to search protein, cDNA, and EST databases Will not identify genes that code for proteins not already in databases (can identify ~50% new genes) Limits of the regions of similarity not well defined -protein: databases SwissProt or PIR///cannot delimit UTRs////not all domains present cDNA///most relevant for determining gene structure/// best if derived from the same organism//// -EST:poor sequence quality///chimeric///contamination with primers///wrong orientation - Sanja Rogic CS Department UBC Computational Gene Finding

Computational Gene Finding Comparative Genomics Based on the assumption that coding sequences are more conserved than non-coding Two approaches: intra-genomic (gene families) inter-genomic (cross-species) Alignment of homologous regions Difficult to define limits of higher similarity Difficult to find optimal evolutionary distance (pattern of conservation differ between loci) Sanja Rogic CS Department UBC Computational Gene Finding

Computational Gene Finding Sanja Rogic CS Department UBC Computational Gene Finding

Summary for Extrinsic Approaches Strengths: Rely on accumulated pre-existing biological data, thus should produce biologically relevant predictions Weaknesses: Limited to pre-existing biological data Errors in databases Difficult to find limits of similarity Sanja Rogic CS Department UBC Computational Gene Finding

Ab initio Gene Finding, Part 1 Input: A DNA string over the alphabet {A,C,G,T} Output: An annotation of the string showing for every nucleotide whether it is coding or non-coding AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCG Gene finder AAAGC ATG CAT TTA ACG A GT GCATC AG GA CTC CAT ACG TAA TGCCG Sanja Rogic CS Department UBC Computational Gene Finding

Ab initio Gene Finding, Part 2 Using only sequence information Identifying only coding exons of protein-coding genes (transcription start site, 5’ and 3’ UTRs are ignored) Integrates coding statistics with signal detection Sanja Rogic CS Department UBC Computational Gene Finding

Coding Statistics, Part 1 Unequal usage of codons in the coding regions is a universal feature of the genomes uneven usage of amino acids in existing proteins uneven usage of synonymous codons (correlates with the abundance of corresponding tRNAs) We can use this feature to differentiate between coding and non-coding regions of the genome Coding statistics - a function that for a given DNA sequence computes a likelihood that the sequence is coding for a protein Sanja Rogic CS Department UBC Computational Gene Finding

Coding Statistics, Part 2 Many different ones codon usage hexamer usage GC content compositional bias between codon positions nucleotide periodicity … Hexamer usage shown to be most discriminative and majority of current algos are using it Sanja Rogic CS Department UBC Computational Gene Finding

An Example of Coding Statistics, Part 1 For each codon the table displays the frequency of usage of each codon (per thousand) in human (first column) Relative frequency of each codon among synonymous codons (second column) Sanja Rogic CS Department UBC Computational Gene Finding

Computing Coding Statistics in Practice Usually, the value of coding statistics is computed using sliding windows coding profile of the sequence Larger windows are required to detect a clear signal (50 – 200 bp) Sliding window = successive overlapping windows Small exons might be missed Sanja Rogic CS Department UBC Computational Gene Finding

Coding Profile of ß-globin gene Window size 120 Distance between overlapping windows 10 LP computed for all three reading frame Sanja Rogic CS Department UBC Computational Gene Finding

Computational Gene Finding Signal Sensors, Part 1 Signal – a string of DNA recognized by the cellular machinery Sanja Rogic CS Department UBC Computational Gene Finding

Computational Gene Finding Signal Sensors, Part 2 Various pattern recognition method are used for identification of these signals: consensus sequences weight matrices weight arrays decision trees Hidden Markov Models (HMMs) neural networks … Sanja Rogic CS Department UBC Computational Gene Finding

Example of Consensus Sequence obtained by choosing the most frequent base at each position of the multiple alignment of subsequences of interest TACGAT TATAAT GATACT TATGAT TATGTT consensus sequence consensus (IUPAC) Leads to loss of information and can produce many false positive or false negative predictions TATAAT MELON MANGO HONEY SWEET COOKY IUPAC –set of symbols encoding each subset of four nucleotides R – purine N- any TATRNT MONEY Sanja Rogic CS Department UBC Computational Gene Finding

Example of (Positional) Weight Matrix Computed by measuring the frequency of every element of every position of the site (weight) Score for any putative site is the sum of the matrix values (converted in probabilities) for that sequence (log-likelihood score) Disadvantages: cut-off value required assumes independence between adjacent bases TACGAT TATAAT GATACT TATGAT TATGTT 1 2 3 4 5 6 A C G T Sanja Rogic CS Department UBC Computational Gene Finding

Examples of Gene Finders FGENES – linear DF for content and signal sensors and DP for finding optimal combination of exons GeneMark – HMMs enhanced with ribosomal binding site recognition Genie – neural networks for splicing, HMMs for coding sensors, overall structure modeled by HMM Genscan – WM, WA and decision trees as signal sensors, HMMs for content sensors, overall HMM HMMgene – HMM trained using conditional maximum likelihood Morgan – decision trees for exon classification, also Markov Models MZEF – quadratic DF, predict only internal exons Sanja Rogic CS Department UBC Computational Gene Finding

Ab initio Gene Finding is Difficult Genes are separated by large intergenic regions Genes are not continuous, but split in a number of (small) coding exons, separated by (larger) non-coding introns in humans coding sequence comprise only a few percents of the genome and an average of 5% of each gene Sequence signals that are essential for elucidation of a gene structure are degenerate and highly unspecific Alternative splicing Repeat elements (>50% in humans) – some contain coding regions It is almost impossible to distinguish between signals that are truly processed by the cell from those that are apparently non-functional Sanja Rogic CS Department UBC Computational Gene Finding

Problems with Ab initio Gene Finding No biological evidence In long genomic sequences many false positive predictions Prediction accuracy high, but not sufficient Sanja Rogic CS Department UBC Computational Gene Finding

End Theory I Mind mapping 10 min break

Practice I

Prokaryotic DNA Finding protein coding regions Finding ORFs Goto NCBI and find the entry for M68521, gi|147118 Get the FASTA sequence Keep the gene bank entry visible

GeneMark http://exon.gatech.edu/GeneMark Paste the sequence and run the prediction Compare the gene bank entry and the predicted sequence Which one would you trust, and why?

Eukaryotes? Prokaryotes are easy .. or are they Look at AE000141 Eukaryotes are significantly more difficult Due to exon-intron structure we need much more sequence to predict a gene Sometimes more than a complete bacterial genome for just one human gene

Theoretical Part II Evaluation of Gene Prediction

Evaluation of Gene Finding Programs Calculating accuracy of programs’ predictions Several evaluation studies: Burset and Guigó, 1996 (vertebrate sequences) Pavy et al., 1999 (Arabidopsis thaliana) Rogic et al., 2001 (mammalian sequences) Sanja Rogic CS Department UBC Computational Gene Finding

Measures of Prediction Accuracy, Part 1 Nucleotide level accuracy Sensitivity = Specificity = TN FP FN TP REALITY PREDICTION number of correct exons number of actual exons number of correct exons number of predicted exons Sanja Rogic CS Department UBC Computational Gene Finding

Measures of Prediction Accuracy, Part 2 Exon level accuracy WRONG EXON CORRECT EXON MISSING EXON REALITY PREDICTION Sanja Rogic CS Department UBC Computational Gene Finding

Computational Gene Finding Evaluation Results Sanja Rogic CS Department UBC Computational Gene Finding

Eukaryotes Prokaryotes are easy .. or are they Look at AE000141 Eukaryotes are significantly more difficult Due to exon-intron structure we need much more sequence to predict a gene Sometimes more than a complete bacterial genome for just one human gene

Gene Finding Content Based Site Based Comparative Codon usage Start Splice sites Regulatory elements Binding sites Polyadenylation signals … Comparative Homology

Eukaryotic Gene Structure Whereas control elements for bacterial promoters tend to be located nearby, eukaryotic control elements can be located up to 50 kb upstream or downstream of the gene. Can also be inside the gene. While most Pol II genes have a TATAA box, some don't 37

Raw Nucleotide Sequences Most sequences are raw nucleotide sequences How do we know whether it is a gene? There are certain measures which indicate that there may be a gene

Finding Genes http://rulai.cshl.org/tools/genefinder/human.htm Get AF018429 from gene bank Enter the FASTA sequence and predict the gene Double check with http://genes.mit.edu/genomescan

More Gene Finding Tools Large Collection http://www.nslij-genetics.org/gene/programs.html GeneScan http://genes.mit.edu/GENSCAN.html HMMgene http://www.cbs.dtu.dk/services/HMMgene/ GeneBuilder http://zeus2.itb.cnr.it/~webgene/genebuilder.html

Gene Finding Fails When similar genes have not been encountered before (e.g.: NTT, IPW) When part of the signals are missing When the “wrong” gene finding tool is used When the gene is small and/or has many introns

Practical Part II

Finding Genes http://rulai.cshl.org/tools/genefinder/human.htm Get AF018429 from gene bank Enter the FASTA sequence and predict the gene Double check with http://genes.mit.edu/genomescan.html

More Gene Finding Tools GeneScan http://genes.mit.edu/GENSCAN.html HMMgene http://www.cbs.dtu.dk/services/HMMgene/ Gene Prediction Software List http://en.wikipedia.org/wiki/List_of_gene_prediction_software

Comparison Use 4 tools to perform gene prediction Store start and end positions for all exons in Excel for the 4 different results Add the annotation from Genbank to the results Scetch a plot showing the predictions on the genome Compare the results

Result Evaluation Are number of exons different? Are start and end positions shifted? How much are they shifted? Are there exons missing in some prediction? How many? Which result gives you the correct gene structure?