CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
An Introduction to Bioinformatics Finding genes in prokaryotes.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
McPromoter – an ancient tool to predict transcription start sites
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
BIOS816/VBMS818 Lecture 7 – Gene Prediction Guoqing Lu Office: E115 Beadle Center Tel: (402) Website:
Introduction to BioInformatics GCB/CIS535
Computational Biology, Part 4 Protein Coding Regions Robert F. Murphy Copyright  All rights reserved.
Lecture 12 Splicing and gene prediction in eukaryotes
Protein Synthesis.
Biological Motivation Gene Finding in Eukaryotic Genomes
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Gene Structure and Identification
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Chapter 11 DNA and Genes. Proteins Form structures and control chemical reactions in cells. Polymers of amino acids. Coded for by specific sequences of.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Peptide Bond Formation Walk the Dogma RECALL: The 4 types of organic molecules… CARBOHYDRATES LIPIDS PROTEINS (amino acid chains) NUCLEIC ACIDS (DNA.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Protein Synthesis Transcription and Translation. Protein Synthesis: Transcription Transcription is divided into 3 processes: –Initiation, Elongation and.
From Genomes to Genes Rui Alves.
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Transcription Vocabulary of transcription: transcription - synthesis of RNA under the direction of DNA messenger RNA (mRNA) - carries genetic message from.
Chapter 8: From DNA to Protein Section Transcription
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
DNA, chromosomes, genes What is a gene? Triplet code? Compare prokaryotic and eukaryotic DNA.
(H)MMs in gene prediction and similarity searches.
Starter What do you know about DNA and gene expression?
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Identification of Coding Sequences Bert Gold, Ph.D., F.A.C.M.G.
Unit 1: DNA and the Genome Structure and function of RNA.
Gene Expression : Transcription and Translation 3.4 & 7.3.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
What is a Hidden Markov Model?
Chapter 10 How Proteins are Made.
Interpolated Markov Models for Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
More on translation.
Transcription & Translation.
Introduction to Bioinformatics II
Generalizations of Markov model to characterize biological sequences
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (IV)
Profile HMMs GeneScan TMMOD
Microbial gene identification using interpolated Markov models
How Proteins are Made.
The Toy Exon Finder.
Presentation transcript:

CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation

CISC667, F05, Lec18, Liao2 Gene prediction strategies Content-based –Codon usage –Periodicity of repeats –Compositional complexity Site-based –Binding sites for transcription factors –polyA tracts, –Donor and acceptor splice sites –Start and stop codons Comparative –BLAST

CISC667, F05, Lec18, Liao3 Gene expression –Transcription: DNA → mRNA –Translation: mRNA → Protein

CISC667, F05, Lec18, Liao4 Kimball’s Biology page

CISC667, F05, Lec18, Liao5 Kimball’s Biology page

CISC667, F05, Lec18, Liao6 ACCUUAGCGUA Thr Leu Ala ACCUUAGCGUA Pro Stop Arg ACCUUAGCGUA Leu Ser Val Reading frame 1 Reading frame 2 Reading frame 3

CISC667, F05, Lec18, Liao7 Open Reading Frame (ORF)

CISC667, F05, Lec18, Liao8 Prokaryotic –Most regions of DNA are coding regions –No introns Eukaryotic –Introns and Exons

CISC667, F05, Lec18, Liao9 Fickett’s rule (1982) –In ORFs, every third base tends to be the same one more often than by chance alone. Regardless species, No knowledge of codon preference is required.

CISC667, F05, Lec18, Liao10 Codon Usage Index –There are 64 codons but 20 amino acids to code, therefore some AAs are coded by multiple codons. For example, 6 codons for Leu, and 4 for Ala, but only one for Try. For random DNA sequences, the frequency of having these three AAs would be 6/4/1 for Lue:Ala:Trp. In real protein sequences, ratio was found to be 6.9/6.5/1, which implies coding DNA sequence is not random; some codons are preferred (depending on species.)

CISC667, F05, Lec18, Liao11 Codon usage database:

CISC667, F05, Lec18, Liao12

CISC667, F05, Lec18, Liao13 ORFs as Markov chains Glimmer: Interpolated Markov models (IMM) –1 st order model: p(a|a), p(a|c), …, p(t|t). Probability of having a amino acid given its previous neighbor. –2 nd order model: p(a|xx) –Up to 8 th order model (0 th for random, 1 st to 6 th for 6 reading frames, why higher order? Why stop at 8 th ? E.g, 5 th order model, need 4 6 conditional probabilities p(a|xxxxx). In a genome of 1.8Mb, for each 6mer, we can observe about 1.8Mb/4096 samples. But the higher k, the less number of samples for kmers. –Interpolation (linear combination of models of different orders) P(S|M) = ∑ x=1 n IMM 8 (S x ) where S x is the oligomer ending at position x and n is the sequence length. The interpolated Markov model score is IMM k (S x ) = λ k (S x-1 ) P k (S x ) + [1- λ k (S x-1 ) ] IMM k-1 (S x ) where λ k (S x-1 ) is the numerical weight associated with the kmer ending at position x-1, and P k (S x ) is the probability of having S x, predicted by k-th order model.

CISC667, F05, Lec18, Liao14 Glimmer (cont’d) Results for H. Influenzae: modelFoundMissed New Glimmer th order

CISC667, F05, Lec18, Liao15 Self-identification (Audic and Claverie ‘98) Probability of sequence W of length L is generated by a k- th order Markov chain P(W|M) = P(S 0 ) ∏ i=k L-1 P(n i | S i-k ) eq(1) where S i is a kmer starting at position i in W. The model contains all the probabilities for any possible kmers to be followed by one of nucleotides A, C, G, or T. k = 5 is used. Which model is better? P(M j | W) = P(W|M j )P(M j ) / ∑ r= 1 to N P(W|M r )P(M r ) eq(2) where a priori probability P(M j ) is assume to be equiprobable for N models, i.e., is 1/N. If we have three models corresponding to coding, reverse coding, and noncoding, then the posterior probability tells what sequence W is more likely to be.

CISC667, F05, Lec18, Liao16 Model building (no training data is required) –How to build the three models? If we have regions labeled as coding, reverse coding, and noncoding, then we can count the frequencies to train the transition matrices. –Self consistent Randomly cut into nonoverlapping pieces of w bases long, and assign them randomly into three distinct subsets, and build three Markov models M1, M2, and M3 respectively. Scan genomic sequence using size w window. For each window segment, determine its class using eq(2). Slide the window by 5 bases, and repeat the process. If a region is covered by n (for 5n ≥ w) successive windows of M j type, it is qualified to be assigned into the j data set. After finishing one scan of the genomic sequence, the 3 subsets are updated, and new Markov models are built for each of the 3 subsets. Repeat the whole process until convergence is reached.

CISC667, F05, Lec18, Liao17 Results (k = 5, w = 100) –Convergence is reached after 50 iterations –H. pylori Correct rate: 95%, 94%, 93.8% for coding, reverse coding and noncoding respectively.

CISC667, F05, Lec18, Liao18 More markov based tools HMMgene –Hidden Markov model whole genes partial genes Cosmids or even longer sequences. GeneScan Genemark –

CISC667, F05, Lec18, Liao19

CISC667, F05, Lec18, Liao20 Neural Network Promoter Prediction Reese MG, ``Computational prediction of gene structure and regulation in the genome of Drosophila melanogaster'', PhD Thesis (PDF), UC Berkeley/University of Hohenheim.

CISC667, F05, Lec18, Liao21 Prediction Assessment ( David Mount, 2 nd ed, page 384 ) –TP (true positive), FP (false positive), TN (true negative), FN (false negative) –Actual positive, negative; Predicted positive, negative –TP + TN + FP + FN = N –Sensitivity = TP /(TP + FN) –Specificity = FP /(TP + FP) –Correlation Coefficient = (TP TN - FP FN)/  (PP PN AP AN ) –CC = [-1, 1]

CISC667, F05, Lec18, Liao22 Strategies for Gene finding Challenges remain: –partial genes, non-coding RNA genes, etc. MZEF best for single exons GENSCAN best for whole genes Shall try one more method for cross reference Shall resort to comparative method, e.g., run BLAST against dbEST and/or protein databases.

CISC667, F05, Lec18, Liao23

CISC667, F05, Lec18, Liao24