Recitation 7 2/4/09 PSSMs+Gene finding

Slides:



Advertisements
Similar presentations
An Introduction to Bioinformatics Finding genes in prokaryotes.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Lecture 4: DNA transcription
Ab initio gene prediction Genome 559, Winter 2011.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Eukaryotic Gene Finding
Gene Finding. Biological Background The Central Dogma Transcription RNA Translation Protein DNA.
Finding Regulatory Motifs in DNA Sequences
(CHAPTER 12- Brooker Text)
Eukaryotic Gene Finding
Gene expression.
Biological Motivation Gene Finding in Eukaryotic Genomes
Finding prokaryotic genes and non intronic eukaryotic genes
Gene Structure and Identification
Transcription transcription Gene sequence (DNA) recopied or transcribed to RNA sequence Gene sequence (DNA) recopied or transcribed to RNA sequence.
Shine-Dalgarno Motif Ribosome binding site located about 13 bases upstream of AUG start codon SD sequence is: 5’-AGGAGGU-3’ Middle GGAG is more highly.
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
CHAPTER 17 FROM GENE TO PROTEIN Copyright © 2002 Pearson Education, Inc., publishing as Benjamin Cummings Section B: The Synthesis and Processing of RNA.
Gene Structure and Function
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Transcription BIT 220 Chapter 12 Basic process of Transcription Figures 12.3 Figure 12.5.
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Chapter 10 Transcription RNA processing Translation Jones and Bartlett Publishers © 2005.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Molecular Biology in a Nutshell (via UCSC Genome Browser) Personalized Medicine: Understanding Your Own Genome Fall 2014.
DNA to Protein – 12 Part one AP Biology. What is a Gene? A gene is a sequence of DNA that contains the information or the code for a protein or an RNA.
Protein Synthesis. Transcription DNA  mRNA Occurs in the nucleus Translation mRNA  tRNA  AA Occurs at the ribosome.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
From Genomes to Genes Rui Alves.
Transcription … from DNA to RNA.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Transcription. Recall: What is the Central Dogma of molecular genetics?
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
The Central Dogma of Molecular Biology replication transcription translation.
Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.
Finding genes in the genome
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Central Dogma Molecular Influences on Genetic Regulation.
HOW DO CELLS KNOW WHEN TO EXPRESS A GENE? DO NOW:.
TRANSCRIPTION (DNA → mRNA). Fig. 17-7a-2 Promoter Transcription unit DNA Start point RNA polymerase Initiation RNA transcript 5 5 Unwound.
Features of the genetic code: Triplet codons (total 64 codons) Nonoverlapping Three stop or nonsense codons UAA (ocher), UAG (amber) and UGA (opal)
Transcription Chapter 17b. Objectives Understand the process of transcription Recognize the role of RNA Polymerase Recognize the significance of promoter.
Protein Synthesis. One Gene – One Enzyme Protein Synthesis.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
bacteria and eukaryotes
Gene Expression - Transcription
Transcription.
From Gene To Protein DNA -> RNA -> Protein
A Quest for Genes What’s a gene? gene (jēn) n.
Transcription.
Genetics Unit I-Part C Transcription
Exam #1 W 9/26 at 7-8:30pm in UTC 2.102A Review T 9/25 at 5pm in WRW 102 and in class 9/26.
DNA TRANSCRIPTION Making mRNA.
Chapter 17 Protein Synthesis.
Introduction to Bioinformatics II
Chapter 17 From Gene to Protein.
Unit 7: Molecular Genetics
The Genetic Code and Transcription
Reading Frames and ORF’s
credit: modification of work by NIH
Gene Structure.
Gene Structure.
Presentation transcript:

Recitation 7 2/4/09 PSSMs+Gene finding Comp. Genomics Recitation 7 2/4/09 PSSMs+Gene finding Partially based on slides by Irit Gat-Viks and Metsada Pasmanik-Chor

Biological Motifs Biological units with common functions frequently exhibit similarities at the sequence level. These include very short “motifs”, such as: Gene splice sites DNA regulatory binding sites (bound by transcription factors) Often it is desirable to model such motifs, to enable searching for new ones. Probabilistic models are very useful. Today we deal with PSSM - the simplest.

E. Coli Promoters

Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA Gene Regulatory Element

Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene

Regulation of Genes New protein RNA polymerase Transcription Factor DNA Regulatory Element Gene

Motif Logo Motifs can mutate on less important bases. Position: 1234567 TGGGGGA TGAGAGA TGAGGGA Motifs can mutate on less important bases. The five motifs at top right have mutations in position 3 and 5. Representations called motif logos illustrate the conserved regions of a motif. http://weblogo.berkeley.edu http://fold.stanford.edu/eblocks/acsearch.html

Example: Calmodulin-Binding Motif (calcium-binding proteins)

PSSM Starting Point A gap-less MSA of known instances of a given motif. Representing the motif by either: Consensus. Position Specific Scoring Matrix (PSSM).

Usage of a PSSM For a putative k-mer GTGC– multiply the probabilities: p1(G)·p2(T)·p3(G)·p4(C) This gives the likelihood of the motif given the PSSM model TATA box motif

Gene finding Only part of the genome encodes proteins 80-90% in bacteria, ab. 2% in humans Goal: Given a genome sequence, identify gene boundaries

The genetic code A protein-coding gene, an open reading frame (ORF) begins with an ATG and ends with one of three stop codons

Prokaryotic genes The ‘easy’ problem Difficulty – not all possible ORFs are actually genes In E.Coli: 6500 ORFs while there are 4290 genes. Additional “handles” are needed

Handle #1: Long ORFs In random DNA, one stop codon every 64/3=21 codons on average. Average protein is ~300 codons long. => search long ORFs. Problems: Short genes Overlapping long ORFs on opposite strands

Handle #2: Codon frequencies Coding DNA is not random: In random DNA, expect Leu : Ala : Trp ratio of 6 : 4 : 1 In real proteins, 6.9 : 6.5 : 1 Different frequencies for different species.

Using Codon Frequencies/Usage Assume each codon is independent. For codon abc calculate frequency f(abc) in coding region. Given coding sequence a1b1c1,…, an+1bn+1cn+1 Calculate The probability that the ith reading frame is the coding region:

Handle #3: G+C content C+G content (“isochore”) has strong effect on gene density, gene length etc. < 43% C+G : 62% of genome, 34% of genes >57% C+G : 3-5% of genome, 28% of genes Gene density in C+G rich regions is 5 times higher than moderate C+G regions and 10 times higher than rich A+T regions Amount of intronic DNA is 3 times higher for A+T rich regions. (Both intron length and number). Etc…

Handle #4: Promoter motifs Transcription depends on regulatory regions. Common regulatory region – the promoter RNA polymerase binds tightly to a specific DNA sequence in the promoter

Gene prediction programs Scan the sequence in all 6 reading frames: Start and stop codons Long ORF Codon usage GC content Gene features: promotor, terminator, poly A sites, exons and introns, … Frame +1 Frame +2 Frame +3

Moving to eukaryotes Less of the genome is protein coding + introns are a (very) serious headache

Eukaryote gene structure Gene length: 30kb, coding region: 1-2kb Binding site: ~6bp; ~30bp upstream of TSS Average of 6 exons, 150bp long Huge variance: - dystrophin: 2.4Mb long Blood coagulation factor: 26 exons, 69bp to 3106bp; intron 22 contains another unrelated gene

Splicing Splicing: the removal of the introns. Performed by complexes called spliceosomes, containing both proteins and snRNA. The snRNA recognizes the splice sites through RNA-RNA base-pairing Recognition must be precise: a 1nt error can shift the reading frame making nonsense of its message. Many genes have alternative splicing which changes the protein created.

Splice Sites