Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Multiple Sequence Alignment Motif Finding and Gene Prediction.
Ab initio gene prediction Genome 559, Winter 2011.
Gene Prediction: Similarity-Based Approaches (selected from Jones/Pevzner lecture notes)
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Pattern Discovery in RNA Secondary Structure Using Affix Trees (when computer scientists meet real molecules) Giulio Pavesi& Giancarlo Mauri Dept. of Computer.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Introduction to BioInformatics GCB/CIS535
Gene Finding Charles Yan.
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Biological Motivation Gene Finding in Eukaryotic Genomes
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Finding prokaryotic genes and non intronic eukaryotic genes
Dynamic Programming II
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Similarity-Based Approaches.
Sequencing a genome and Basic Sequence Alignment
Regulation of eukaryotic gene sequence expression
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
You should be able to label these pictures Label the following: –RNA polymerase –DNA –mRNA –tRNA –5’ end –3’ end –Amino acid –Ribosome –Polypeptide chain.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Sequencing a genome and Basic Sequence Alignment
Gene Prediction: Statistical Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 20, 2005 ChengXiang Zhai Department of Computer Science.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Review of Protein Synthesis. Fig TRANSCRIPTION TRANSLATION DNA mRNA Ribosome Polypeptide (a) Bacterial cell Nuclear envelope TRANSCRIPTION RNA PROCESSING.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Motif Search and RNA Structure Prediction Lesson 9.
Splicing Exons: A Eukaryotic Challenge to Gene Prediction Ian McCoy.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
Finding genes in the genome
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Eukaryotic Gene Structure
Which of the following would be the corresponding amino acid sequence that would be translated as a protein product of the following segment of DNA? A.
BTY100-Lec#4.2 DNA to Protein (Central Dogma).
Eukaryotic Gene Finding
Central Dogma.
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
The Structure of the Genome
Nora Pierstorff Dept. of Genetics University of Cologne
Gene Structure.
The Toy Exon Finder.
Gene Structure.
Presentation transcript:

Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from

The Gene Prediction Problem Given genome sequences, determine where are the genes The problem is easier for prokaryotes (no introns) The problem is significantly harder for eukaryotes (alternative splicing)

Splicing Causes Problem…

Exons vs. Introns Exon: A portion of the gene that appears in both the primary and the mature mRNA transcripts. Intron: A portion of the gene that is transcribed but excised prior to translation.

Definition of a Gene Regulatory regions: up to 50 kb upstream of +1 site Exons: protein coding and untranslated regions (UTR) 1 to 178 exons per gene (mean 8.8) 8 bp to 17 kb per exon (mean 145 bp) Introns:splice acceptor and donor sites, junk DNA average 1 kb – 50 kb per intron Gene size:Largest – 2.4 Mb (Dystrophin). Mean – 27 kb.

Different Views of a Gene Exons Introns e1 e2e3 e1 e2e3 MSRTAQ… Pre-mRNA mRNA Protein DNA ATGCTTGCCAAAT…TCG… Gene

Approaches to Gene Prediction Similarity-based approaches: –Exploit the fact that many genes are conserved across species –Can be highly reliable –Only good for finding unknown genes Statistical approaches –Exploit statistical characteristics of coding regions and non- coding regions and other knowledge about genes –Can potentially detect new genes –May not be reliable They can/should be combined –Currently no principled approaches for doing this

Outline The idea of similarity-based approach to gene prediction Exon Chaining Problem Spliced Alignment Problem

Using Known Genes to Predict New Genes Some organism’s genome may be very well- documented, with many genes having been experimentally verified. Closely-related organisms may have similar genes Unknown genes in one species may be compared to genes in some closely-related species

Comparing Genes in Two Genomes Small islands of similarity corresponding to similarities between exons

Reverse Translation Given a known protein, find a gene in the genome which codes for it One might infer the coding DNA of the given protein by reversing the translation process –Inexact: amino acids map to > 1 codon –This problem is essentially reduced to an alignment problem

Comparing Genomic DNA Against mRNA Portion of genome mRNA (codon sequence) exon3exon1exon2 {{{ intron1intron2 {{

Using Similarities to Find the Exon Structure The known frog gene is aligned to different locations in the human genome Find the “best” path to reveal the exon structure of human gene Frog Gene (known) Human Genome

Finding Local Alignments Use local alignments to find all islands of similarity Human Genome Frog Genes (known)

Chaining Local Alignments Find substrings that match a given gene sequence (candidate exons) Define structure of candidate exons as (l, r, w) (left, right, weight defined as score of local alignment) Look for a maximum chain of substrings –Chain: a set of non-overlapping nonadjacent intervals.

Exon Chaining Problem Locate the beginning and end of each interval (2n points) Find the “best” path

Exon Chaining Problem: Formulation Exon Chaining Problem: Given a set of putative exons, find a maximum set of non- overlapping putative exons Input: a set of weighted intervals (putative exons) Output: A maximum chain of intervals from this set

Exon Chaining: Graph Representation This problem can be solved with dynamic programming in O(n) time.

Exon Chaining Algorithm ExonChaining (G, n) //Graph, number of intervals 1 for i ← to 2n 2 s i ← 0 3 for i ← 1 to 2n 4 if vertex v i in G corresponds to right end of interval I 5 j ← index of vertex for left end of the interval I 6 w ← weight of the interval I 7 s j ← max {s j + w, s i-1 } 8 else 9 s i ← s i-1 10 return s 2n

Exon Chaining: Deficiencies –Poor definition of the putative exon endpoints –Optimal chain of intervals may not correspond to any valid alignment First interval may correspond to a suffix, whereas second interval may correspond to a prefix Combination of such intervals is not a valid alignment

Spliced Alignment Proposed in 1996 by Mikhail Gelfand and colleagues Goal: Use a protein within one genome to reconstruct the exon-intron structure of a (related) gene in another genome. Method –Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem) –Find a chain of putative exons that has the highest similarity to the target protein

Spliced Alignment Problem: Formulation Goal: Find a chain of blocks in a genomic sequence that best fits a target sequence Input: Genomic sequences G, target sequence T, and set of candidate exons B. Output: A chain of exons Γ such that the global alignment score s(Γ*, T) is maximum among all chains of blocks from B. Γ* is the string formed by concatenating strings in Γ. Essentially an alignment problem…

Lewis Carroll Example

The solution to the sliced alignment problem will be discussed later when we talk about sequence alignment…

What You Should Know Why splicing causes difficulty in gene prediction The formulation and algorithm for Exon Chaining Why Spliced Alignment is a better formulation than Exon Chaining