Dynamic Programming (cont’d) CS 466 Saurabh Sinha.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
Protein Targetting Prokaryotes vs. Eukaryotes Mutations
Dynamic Programming: Sequence alignment
Ab initio gene prediction Genome 559, Winter 2011.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Gene Prediction: Similarity-Based Approaches (selected from Jones/Pevzner lecture notes)
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
The Molecular Genetics of Gene Expression
Sequencing and Sequence Alignment
Sequence Alignment.
Gene Finding Charles Yan.
CSE182-L12 Gene Finding.
Gene Prediction: Statistical Approaches Lecture 22.
Gene Finding. Biological Background The Central Dogma Transcription RNA Translation Protein DNA.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Lecture 12 Splicing and gene prediction in eukaryotes
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Biological Motivation Gene Finding in Eukaryotic Genomes
Dynamic Programming II
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Similarity-Based Approaches.
Sequencing a genome and Basic Sequence Alignment
Central Dogma First described by Francis Crick
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
Sequence Alignment.
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Chapter 2 Genes Encode RNAs and Polypeptides
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Intelligent Systems for Bioinformatics Michael J. Watts
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Sequencing a genome and Basic Sequence Alignment
Chapter 13. The Central Dogma of Biology: RNA Structure: 1. It is a nucleic acid. 2. It is made of monomers called nucleotides 3. There are two differences.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
From Genomes to Genes Rui Alves.
Genome Annotation Haixu Tang School of Informatics.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Splicing Exons: A Eukaryotic Challenge to Gene Prediction Ian McCoy.
(H)MMs in gene prediction and similarity searches.
D.N.A Describe how you would go about genetically engineering a bacterium to produce human epidermal growth factor (EGF), a protein used in treating burns.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Features of the genetic code: Triplet codons (total 64 codons) Nonoverlapping Three stop or nonsense codons UAA (ocher), UAG (amber) and UGA (opal)
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Sequence Alignment.
Chapter 10 How Proteins are Made.
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Sequence Alignment Using Dynamic Programming
Introduction to Bioinformatics II
Dynamic Programming (cont’d)
Gene Prediction: Statistical Approaches
Genes Code for Proteins
Genes Encode RNAs and Polypeptides
The Toy Exon Finder.
Presentation transcript:

Dynamic Programming (cont’d) CS 466 Saurabh Sinha

Affine Gap Penalties In nature, a series of k indels often come as a single event rather than a series of k single nucleotide events: Normal scoring would give the same score for both alignments This is more likely. This is less likely. ATA__GGC ATGATCGC ATA_G_GC ATGATCGC

Accounting for Gaps Gaps- contiguous sequence of spaces in one of the rows Score for a gap of length x is: -(ρ + σx) where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much of a penalty for extending the gap.

Affine gap penalty in DP When computing s i,j, need to look at s i,j-1, s i,j-2, s i,j-3,…. and s i-1,j, s i-2,j, … Each cell needs O(n) time for update O(n 2 ) cells Therefore, O(n 3 ) algorithm We can still do this in O(n 2 ) time

Affine Gap Penalty Recurrences s i,j = s i-1,j - σ max s i-1,j –(ρ+σ) s i,j = s i,j-1 - σ max s i,j-1 –(ρ+σ) s i,j = s i-1,j-1 + δ (v i, w j ) max s i,j s i,j Continue Gap in w (deletion) Start Gap in w (deletion): from middle Continue Gap in v (insertion) Start Gap in v (insertion):from middle Match or Mismatch End deletion: from top End insertion: from bottom

Optional Reading Section 6.10 (J & P) Multiple Alignment

Gene Prediction

Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions of genes in a genome Gene Prediction: Computational Challenge

The Genetic Code SOURCE:

In 1961 Sydney Brenner and Francis Crick discovered frameshift mutations Systematically deleted nucleotides from DNA –Single and double deletions dramatically altered protein product –Effects of triple deletions were minor –Conclusion: every triplet of nucleotides, each codon, codes for exactly one amino acid in a protein Codons

In 1964, Charles Yanofsky and Sydney Brenner proved colinearity in the order of codons with respect to amino acids in proteins As a result, it was incorrectly assumed that the triplets encoding for amino acid sequences form contiguous strips of information. Great Discovery Provoking Wrong Assumption

Exons and Introns In eukaryotes, the gene is a combination of coding segments (exons) that are interrupted by non-coding segments (introns) This makes computational gene prediction in eukaryotes even more difficult Prokaryotes don’t have introns - Genes in prokaryotes are continuous

Splicing exon1 exon2exon3 intron1intron2 transcription translation splicing exon = coding intron = non-coding Batzoglou

Gene prediction More difficult in eukaryotes than in prokaryotes (due to introns). In human genome, ~3% of DNA sequence is genes Lot of “junk” DNA between genes, and even inside genes (between exons). Gene prediction must deal with this.

Gene prediction: broadly speaking Statistical approaches: look for features than appear frequently in genes and infrequently elsewhere Similarity based approaches: a newly sequenced gene may be similar to a known gene. –even this is not so simple. The exon structures may be different between otherwise similar genes

Statistical approaches

Let us consider gene prediction in prokaryotes (no introns) Detect potential coding regions by looking at ORFs –A region of length n is comprised of (n/3) codons –Stop codons break genome into segments between consecutive Stop codons –The subsegments of these that start from the Start codon (ATG) are ORFs Genomic Sequence Open reading frame ATGTGA Open Reading Frames (ORFs)

ORFs 6 reading frames in any given sequence –6 ways to map the DNA sequence to codon sequence (+1,+2,+3,-1,-2,-3) –3 on either strand Look at all 6 reading frames for ORFs

Long open reading frames may be a gene –At random, we should expect one stop codon every (64/3) ~= 21 codons –However, genes are usually much longer than this A basic approach is to scan for ORFs whose length exceeds certain threshold –This is naïve because some genes (e.g. some neural and immune system genes) are relatively short Long vs.Short ORFs

Codon usage In a given sequence (e.g., an ORF), compute frequency distribution of codons (64 element array): codon usage array Codon usage array for coding sequences is different from that for non-coding sequences If the codon usage array for an ORF is much more similar to that of coding sequences than to that of non-coding sequences, the ORF could be a gene

Codon usage Codons coding for “Arg” in human: –CGU: 37%, CGC: 38%, CGA: 7%, CGG: 10%, AGA: 5%, AGG: 3% –In a coding sequence, codon CGC is 12 times more likely than codon AGG –An ORF preferring CGC over AGG is likely to be a gene

Codon Usage in Human Genome

Codon usage One way to test if an ORF is a gene is to compute –Pr(ORF sequence under a coding sequence model) –Pr(ORF sequence under a non-coding model) –Ratio of the two. These methods work best in prokaryotes The exon-intron trouble is not handled yet

Promoter Structure in Prokaryotes (E.Coli) Transcription starts at offset 0. Pribnow Box (-10) Gilbert Box (-30) Ribosomal Binding Site (+10)

Ribosomal Binding Site

Splicing Signals: an additional statistical clue, for eukaryotes Exons are interspersed with introns and typically flanked by GT and AG

Splice site detection 5’ 3’ Donor site Position % From lectures by Serafim Batzoglou (Stanford)

Consensus splice sites

Statistical approaches: summary Codon usage Promoter motifs Ribosome binding site Splicing sites

Similarity based approaches

Some genomes may be very well-studied, with many genes having been experimentally verified. Closely-related organisms may have similar genes Unknown genes in one species may be compared to genes in some closely- related species

The basic approach Given a protein sequence, and a genomic sequence, find a set of substrings of the genomic sequence whose concatenation best fits the protein sequence Deals with the exon-intron problem First cut: Find fragments in the genomic sequence that match portions of the protein sequence (local alignment) Then find the “optimal” subset of non-overlapping fragments

Exon chaining Each of the fragments of the genomic sequence that somewhat match the protein (locally) is a putative exon The “goodness” of the match is the “weight” assigned to this putative exon Thus, we have a set of weighted intervals (l,r,w): for a fragment from l to r, with weight w representing how well it matches (a portion of) the protein

Exon Chaining Problem Input: A set of weighted intervals (l,r,w) Output: A maximum weight chain of non-overlapping intervals from this set

Exon Chaining Problem: Graph Representation This problem can be solved with dynamic programming in O(n) time. 21 edge from every l i to r i edge between every two successive vertices

Assumptions No two intervals have a common boundary point. So the (l i,r i ) define 2n distinct points, if there are n intervals

Exon Chaining Algorithm ExonChaining (G, n) //Graph, number of intervals for i ← to 2n s i ← 0 for i ← 1 to 2n if vertex v i in G corresponds to right end of the interval I j ← index of vertex for left end of the interval I w ← weight of the interval I s i ← max {s j + w, s i-1 } else s i ← s i-1 return s 2n

Not very helpful A chain is a set of non-overlapping exons in order (left to right) But the matching protein portions may not be in the same order !

Spliced Alignment Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem). This set is further filtered in a such a way that attempt to retain all true exons, with some false ones. Then find the chain of exons such that the sequence similarity to the target protein sequence is maximized

Spliced Alignment Problem: Formulation Input: Genomic sequences G, target sequence T, and a set of candidate exons (blocks) B. Output: A chain of exons Γ such that the global alignment score between Γ* and T is maximized Γ* - concatenation of all exons from chain Γ

Dynamic programming Genomic sequence G = g 1 g 2 …g n Target sequence T = t 1 t 2 …t m As usual, we want to find the optimal alignment score of the i-prefix of G and the j-prefix of T Problem is, there are many i-prefixes possible (since multiple blocks may include position i)

Idea Find the optimal alignment score of the i-prefix of G and the j-prefix of T assuming that this alignment uses a particular block B at position i S(i, j, B) For every block B that includes i

Recurrence If i is not the starting vertex of block B: S(i, j, B) = max { S(i – 1, j, B) – indel penalty S(i, j – 1, B) – indel penalty S(i – 1, j – 1, B) + δ(g i, t j ) } If i is the starting vertex of block B: S(i, j, B) = max { S(i, j – 1, B) – indel penalty max all blocks B’ preceding block B S(end(B’), j, B’) – indel penalty max all blocks B’ preceding block B S(end(B’), j – 1, B’) + δ(g i, t j ) }