From Genomes to Genes Rui Alves.

Slides:



Advertisements
Similar presentations
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Advertisements

Biological Motivation Gene Finding
An Introduction to Bioinformatics Finding genes in prokaryotes.
Bioinformatics as Hard Disk Investigation Assuming you can read all the bits on a 1000 year old hard drive Can you figure out what does what? - Distinguish.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
SBI 4U November 14 th, What is the central dogma? 2. Where does translation occur in the cell? 3. Where does transcription occur in the cell?
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
BIOS816/VBMS818 Lecture 7 – Gene Prediction Guoqing Lu Office: E115 Beadle Center Tel: (402) Website:
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Identification Lab
Introduction to BioInformatics GCB/CIS535
CSE182-L12 Gene Finding.
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.
Gene Structure and Identification
Shine-Dalgarno Motif Ribosome binding site located about 13 bases upstream of AUG start codon SD sequence is: 5’-AGGAGGU-3’ Middle GGAG is more highly.
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Biology 1060 Chapter 17 From Gene to Protein. Genetic Information Important: Fig Describe how genes control phenotype –E.g., explain dwarfism in.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Intelligent Systems for Bioinformatics Michael J. Watts
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Molecular Biology in a Nutshell (via UCSC Genome Browser) Personalized Medicine: Understanding Your Own Genome Fall 2014.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop January 31, 2012.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
Genome Annotation Rosana O. Babu.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Predicting protein degradation rates Karen Page. The central dogma DNA RNA protein Transcription Translation The expression of genetic information stored.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Motif Search and RNA Structure Prediction Lesson 9.
Applied Bioinformatics
Lesson Four Structure of a Gene. Gene Structure What is a gene? Gene: a unit of DNA on a chromosome that codes for a protein(s) –Exons –Introns –Promoter.
Transcription and Translation of DNA How does DNA transmit information within the cell? PROTEINS! How do we get from DNA to protein??? The central dogma.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Annotation of eukaryotic genomes
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
HOW DO CELLS KNOW WHEN TO EXPRESS A GENE? DO NOW:.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Bacterial infection by lytic virus
bacteria and eukaryotes
Bacterial infection by lytic virus
What is a Hidden Markov Model?
Lesson Four Structure of a Gene.
Lesson Four Structure of a Gene.
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
4. HMMs for gene finding HMM Ability to model grammar
The Toy Exon Finder.
Presentation transcript:

From Genomes to Genes Rui Alves

How to make sense of genome sequences? How do I know where genes are? …atgattattggcggaatcggcggtgcaaggacacaaacaggactcagattcgaagaacgtacagacttacgaaagttgtttgaagaaattcc…

Predicting ORFs is easy, predicting genes is hard An ORF is a sequence of nucleotides that goes from a start codon (ATG, GTG,…) to a stop codon (GTA) Finding them is as easy as reading the DNA sequence How do we know if an ORF is a gene?

There are several ways to predict genes By homology

Homology predictions Sequence of known gene Homologue gene …Sequenced … Genome… Homologue gene

How are sequences aligned? Substitution probability table A C - … 1 0.001 …UUACAUUUCCCGUCCGCUCU… …GGGGUUAAUUUGCCCGUCCA… …UUACAUUUCCCGUCCGCUCU… …GGGGUUAAUUUGCCCGUCCA… S2>S1 S1

Problems of homology predictions: The genetic code NO HOMOLOGY!! …UUAAUUUCCCGUCCG… …CUUAUAAGUAGACCA… Yet, the code is for the same peptide …LISRP…

Solution for redundancy of genetic code: Use synonymous substitution when doing the DNA alignment The problem of doing this: …UUAAUUUCCCGUCCG… …UUAAUUUCCCGUCCA… …UUAAUUUCCAGACCG… … …CUUAUAAGUAGACCA… Combinatorial Explosion!!! Solutions? Not many, efficient algorithms, more computer power, pacience

Homology predictions most effective for closely related organisms Thus, homology-based gene predictions works best when the genome of a close organism has been fully sequenced and annotated!!!

There are other ways to predict if Orfs are genes By homology Ab initio methods Signal Sensors ATG sites Promoter elements id Regulatory elements id Shine-Dalgarno sequences id (i.e. rybosome binding sites) …

Using initiation and termination codons to identify ORFs ATG is the start codon GTG, CTG, TTG are minor start codons If termination codon too close to ATG then ORFs unlikely to be gene atgaatgaatgctgccgaagatctctggcaccaaattttggagcggttgcag… atgaatgaatgctgccgaagatctctggcaccaaattttggagcggtgacag…

Using Promoter sequences to identify ORFs Many promoters have a known structure Identifying Promoters close to initiation codons increases likelihood of ORF being gene Lac promoter

Using response elements to identify ORFs Regulatory binding sites (RBS) have a known structure Identifying RBS close to initiation codons increases likelihood of ORF being gene

Using Rybosomal binding sequences to identify ORFs Rybosomal binding sites (SDS) have a known structure Identifying SDS close to initiation codons increases likelihood of ORF being gene AGGAGG Consensus Shine-Dalgarno sequence

There are several ways to predict genes By homology Ab initio methods Signal Sensors Promoter elements id Regulatory elements id Shine-Dalgarno sequences id (i.e. rybosome binding sites) ATG sites … Content Sensors Codon usage GC content Position assymetry CpG islands

Using codon bias to predict expressed ORFs Average Codon usage Ile RF1 ATT ATC ATA 0.34 0.26 0.40 Frequency of synonymous codons in an organism are not uniform Frequency of synonymous codons in coding sequences is different from that in non-coding sequences This can be used to predict coding open reading frames Average Codon Usage Ile ATT ATC ATA 0.34 0.46 0.20 atgaatgcatgctgccgaagatctctggcaccaaattttggagcggttgcag… Average Codon usage Ile RF2 ATT ATC ATA 0.40 0.20 Average Codon usage Ile RF3 ATT ATC ATA 0.32 0.42 0.25 The third reading frame is the most likely to be a gene

Using GC content to predict expressed ORFs gtgattagctctgccgaagatctctggcaccaaattttggagcggttgcag… Frame 1 Frame 2 Frame 3 11 9 5 The G+C content of the third position of codons in coding sequences is biased Genes have a very high (low) G+C content on the third position of the codons in the reading frame. Frame 1 (3) more likely to be expressed Not very usefull for eukaryotes

Using position assymetry to predict expressed ORFs Av Gene A T C G Position 1 0.20 0.22 0.40 Position 2 0.38 Position 3 0.30 0.24 Coding sequences have a characteristic distribution of nucleotides in each of the three positions of codons gtgaatgtatgctctgccgaagatctctggcaccaaattttggagcggttgcag… RF3 A T C G Position 1 0.45 0.15 0.25 Position 2 0.20 0.18 0.30 0.32 Position 3 0.11 0.36 RF2 A T C G Position 1 0.38 0.24 0.19 Position 2 Position 3 0.25 RF1 A T C G Position 1 0.19 0.24 0.38 Position 2 Position 3 0.29

Using position assymetry to predict expressed ORFs Reading Frame 1 the most likely because it has the highest similarity to the position assymetry of known genes.

CpG Islands are signals for transcription initiation Near the promoter of known genes, the content of CG dinucleotides is higher than that away from initiation of transcription sites Thus, ATG preceded by CpG island are more likely to be genes

Other assimetry measures of gene likelihood Dinucleotide bias Hexanucleotide bias …

Summary Genes can be predicted by Homology Content sensors Signal sensors If you need to annotate a genome, e.g. go to TIGR

How are eukaryotic genes different? DNA mRNA RNA Pol Ryb Protein

How are eukaryotic genes different? DNA mRNA RNA Pol Ryb Spliceosome Protein mRNA Correctly Identifying Splicing sites is not a trivial task

How do we predict splicing sites? By Homology Ab initio SS motifs Codon usage Exonic Splicing Enhancers Intronic Splicing Enhancers Exonic Splicing Silencers Intronic Splicing Silencers

Homology Splice Site Prediction Known spliced gene Predicted spliced gene

Splice Site Motifs

Exonic Splicing Enhancers

Exonic Splicing Silencers Genes & Development 18:1241-1250

Interaction between SE and SI

Rules for Splicing 3’ end likely target for repression Distance between SE and 3’ end < 100bp Splicing efficiency a p(interaction SEC-3’ end)

Methods for splicing detection Set of know spliced genes Test set of know spliced genes Training set of know spliced genes GA, NN, HMM Bayes,ME GA, NN, HMM Bayesian Algorithm Test set Predictions

A Genetic Algorithm Method Motif DM1 … AMi … EM AM p(i) IM Shuffle lines and columns k times and each time calculate the probability of a given combination of motifs getting spliced Select m best combinations and continue to evolve the algorithm until it predicts training set

A Neural Net Method Sequences Predicted Splicing Corrected Weight Table for splice elements Weight Table for splice elements Hidden Nodes Predicted Splicing

Summary Eukaryotic genes have exons Biological rules combined with mathematical and statistical approaches can be used to predict the boundaries for the exons and to predict the splice variants

How to find what genes a string of DNA contains Rui Alves

Simple steps Go to a known gene prediction server (or google for one) Input sequence and wait for prediction Get prediction(s), either as cDNA or as a tranlated protein sequence and do homology searches to identify them in a known database (e.g. NCBI or SWISSPROT)

Simple steps a) Go to a known gene prediction server (or google for one) Input sequence and wait for prediction Get prediction(s), either as cDNA or as a translated protein sequence and do homology searches to identify them

Paper Presentation The human genome (Science) vs. The human genome (Nature) Nature : Pages 875 to 901 Science: Pages 1317-1337 Compare the differences in methods and results for the annotation DO NOT SPEND TIME TALKING ABOUT THE SEQUENCING OR ASSEMBLY ITSELF Do not go into the comparative genome analysis