1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.

Slides:



Advertisements
Similar presentations
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Advertisements

Basics of Comparative Genomics Dr G. P. S. Raghava.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Psi-BLAST, Prosite, UCSC Genome Browser Lecture 3.
6/10/2015Changhui (Charles) Yan1 Gene Finding Changhui (Charles) Yan.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Gene Finding Charles Yan.
CSE182-L12 Gene Finding.
Eukaryotic Gene Finding
The Human Genome Project Public: International Human Genome Sequencing Consortium (aka HUGO) Private: Celera Genomics, Inc. (aka TIGR)
Eukaryotic Gene Finding
Relationship between Genotype and Phenotype
Biological Motivation Gene Finding in Eukaryotic Genomes
Genome organization Eukaryotic genomes are complex and DNA amounts and organization vary widely between species.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Bikash Shakya Emma Lang Jorge Diaz.  BLASTx entire sequence against 9 plant genomes. RepeatMasker  55.47% repetitive sequences  82.5% retroelements.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
RNA Structure and Transcription Mrs. MacWilliams Academic Biology.
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.
Chapter 10 Transcription RNA processing Translation Jones and Bartlett Publishers © 2005.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Transcription Packet #10 Chapter #8.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Protein Synthesis. DNA is in the form of specific sequences of nucleotides along the DNA strands The DNA inherited by an organism leads to specific traits.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Transcription in Prokaryotic (Bacteria) The conversion of DNA into an RNA transcript requires an enzyme known as RNA polymerase RNA polymerase – Catalyzes.
Complexities of Gene Expression Cells have regulated, complex systems –Not all genes are expressed in every cell –Many genes are not expressed all of.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
GENE REGULATION RESULTS IN DIFFERENTIAL GENE EXPRESSION, LEADING TO CELL SPECIALIZATION Eukaryotic DNA.
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
Applied Bioinformatics
Finding genes in the genome
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Eukaryotic Gene Regulation
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Relationship between Genotype and Phenotype
The Transcriptional Landscape of the Mammalian Genome
Eukaryotic Gene Structure
Basics of Comparative Genomics
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
What do you with a whole genome sequence?
Pairwise Sequence Alignment
Basics of Comparative Genomics
Introduction to Alternative Splicing and my research report
Basic Local Alignment Search Tool
Gene Structure.
Relationship between Genotype and Phenotype
Gene Structure.
Presentation transcript:

1 Gene Finding Charles Yan

2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where are the genes? How the genes are regulated?

3 Genome

4 Human Genome Project (HGP) To determine the sequences of the 3 billion bases that make up human DNA 99% human DNA sequence finished to 99.99% accuracy (April 2003) To identify the approximate 100,000 genes in human DNA (The estimates has been changed to 20,000-25,000 by Oct 2004) 15,000 full-length human genes identified (March 2003) To store this information in databases To develop tools for data analysis

5 Model Organisms Finished genome sequences of E. coli, S. cerevisiae, C. elegans, D. melanogaster (April 2003)

6 Completely Sequenced Genomes

7 Gene Finding More than 60 eukaryotic genome sequencing projects are underway

8 Gene Finding There is still a real need for accurate and fast tools to analyze these sequences and, especially, to find genes and determine their functions.

9 Gene Finding Homology methods, also called `extrinsic methods‘ it seems that only approximately half of the genes can be found by homology to other known genes (although this percentage is of course increasing as more genomes get sequenced). Gene prediction methods or `intrinsic methods‘ (

10 Gene Finding Eukaryotes and Prokaryotes

11 Gene Finding Prokaryotes No introns The intergenic regions are small Genes may often overlap each other The translation starts are difficult to predict correctly

12 Genes Functionally, a eukaryotic gene can be defined as being composed of a transcribed region (coding region) and of regions (regulatory region) that cis-regulate the gene expression, such as the promoter region which controls both the site and the extent of transcription. The currently existing gene prediction software look only for the transcribed region (coding region) of genes, which is then called `the gene'.

13 Genes A gene is further divided into exons and introns, the latter being removed during the splicing mechanism that leads to the mature mRNA.

14 Functional sites (Signals) In the mature mRNA, the untranslated terminal regions (UTRs) are the non-coding transcribed regions, which are located upstream of the translation initiation (5 ’ -UTR) and downstream (3 ’ -UTR) of the translation stop. They are known to play a role in the post- transcriptional regulation of gene expression, such as the regulation of translation and the control of mRNA decay

15 Functional sites (Signals) Inside or at the boundaries of the various genomic regions, specific functional sites (or signals) are documented to be involved in the various levels of protein encoding gene expression. Transcription (transcription factor binding sites and TATA boxes) Splicing (donor and acceptor sites and branch points) Polyadenylation [poly(A) site], Translation (initiation site, generally ATG with exceptions, and stop codons)

16 Functional sites (Signals)

17 Gene Finding Two different types of information are currently used to try to locate genes in a genomic sequence. (i) Content sensors are measures that try to classify a DNA region into types, e.g. coding versus non-coding. (ii) Signal sensors are measures that try to detect the presence of the functional sites specific to a gene.

18 Gene Finding Content Sensors Extrinsic content sensors Base on similarity searching Intrinsic content sensors Prediction methods

19 Extrinsic Content Sensors Extrinsic content sensors The basic tools for detecting sufficient similarity between sequences are local alignment methods ranging from the optimal Smith-Waterman algorithm to fast heuristic approaches such as FASTA and BLAST

20 Extrinsic Content Sensors Similarities with three different types of sequences may provide information about exon/intron locations.

21 Extrinsic Content Sensors The first and most widely used are protein sequences that can be found in databases such as SwissProt or PIR. Pos: Almost 50% of the genes can be identified thanks to a sufficient similarity score with a homologous protein sequence. Neg: Even when a good hit is obtained, a complete exact identification of the gene structure can still remain difficult because homologous proteins may not share all of their domains. Neg: UTRs cannot be delimited in this way

22 Extrinsic Content Sensors The second type of sequences are transcripts, sequenced as cDNAs (a cDNA is a DNA copy of a mRNA) either in the classical way for targeted individual genes with high coverage sequencing of the complete clone or as expressed sequence tags (ESTs), which are one shot sequences from a whole cDNA library. Pos: ESTs and `classical' cDNAs are the most relevant information to establish the structure of a gene.

23 Extrinsic Content Sensors Finally, under the assumption that coding sequences are more conserved than non-coding ones, similarity with genomic DNA can also be a valuable source of information on exon/intron location. Intra-genomic comparisons can provide data for multigenic families, apparently representing a large percentage of the existing genes (e.g. 80% for Arabidopsis) (Paralogous genes) Inter-genomic (cross-species) comparisons can allow the identification of orthologous genes, even without any preliminary knowledge of them.

24 Extrinsic Content Sensors Orthologous: Homologous sequences in different species that arose from a common ancestral gene during speciation. Paralogous: Homologous sequences in the same species caused by a gene duplication occurred in an ancestral species, leaving two copies in all descendants.

25 Extrinsic Content Sensors Disadvantages of genomic comparisons Distantly related: The similarity may not cover entire coding exons but be limited to the most conserved part of them. Closely related: It may sometimes extend to introns and/or to the UTRs and promoter elements. In both cases, exactly discriminating between coding and non-coding sequences is not an obvious task.

26 Extrinsic Content Sensors Advantages of Extrinsic Content Sensors An important strength of similarity-based approaches is that predictions rely on accumulated preexisting biological data (with the caveat mentioned later of possible poor database quality). They should thus produce biologically relevant predictions (even if only partial). Another important point is that a single match is enough to detect the presence of a gene

27 Extrinsic Content Sensors Disadvantages of Extrinsic Content Sensors Databases may contain information of poor quality Nothing will be found if the database does not contain a sufficiently similar sequence Even when a good similarity is found, the limits of the regions of similarity, which should indicate exons, are not always very precise and do not enable an accurate identification of the structure of the gene. Small exons are easily missed.

28 Gene Finding Content sensors Extrinsic content sensors Compare with protein sequences Compare with cDNA and ESTs Genomic comparisons Intrinsic content sensors Prediction methods Signal sensors