Genes, Genomes, and Genomics

Slides:



Advertisements
Similar presentations
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Introduction to BioInformatics GCB/CIS535
Comparative ab initio prediction of gene structures using pair HMMs
Eukaryotic Gene Finding
Eukaryotic Gene Finding
RNA.
Genome Annotation BCB 660 October 20, From Carson Holt.
Biological Motivation Gene Finding in Eukaryotic Genomes
Genome Analysis & Gene Prediction. Overview about Genes Gene : whole nucleic acid sequence necessary for the synthesis of a functional protein (or functional.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Gene Structure and Identification
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Initiating translation
Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.
Chapter 14 – RNA molecules and RNA processing
Chapter 10 Transcription RNA processing Translation Jones and Bartlett Publishers © 2005.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.
Do Now: On the “Modeling DNA” handout, determine the complimentary DNA sequence and the mRNA sequence by using the sequence given.
Genome Annotation Rosana O. Babu.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Review of Protein Synthesis. Fig TRANSCRIPTION TRANSLATION DNA mRNA Ribosome Polypeptide (a) Bacterial cell Nuclear envelope TRANSCRIPTION RNA PROCESSING.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Prokaryotic cells turn genes on and off by controlling transcription.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN SYNTHESIS HOW GENES ARE EXPRESSED. BEADLE AND TATUM-1930’S One Gene-One Enzyme Hypothesis.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Exam #1 is T 2/17 in class (bring cheat sheet). Protein DNA is used to produce RNA and/or proteins, but not all genes are expressed at the same time or.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
EGASP 2005 Evaluation Protocol
Eukaryotic Gene Structure
Fig Prokaryotes and Eukaryotes
EGASP 2005 Evaluation Protocol
Protein Synthesis Part 3
Exam #1 is T 9/23 in class (bring cheat sheet).
Protein Synthesis Part 3
RNA Molecules and RNA Processing
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Protein Synthesis Part 3
Introduction to Bioinformatics II
Protein Synthesis The genetic code – the sequence of nucleotides in DNA – is ultimately translated into the sequence of amino acids in proteins – gene.
From Mendel to Genomics
From DNA to Protein Class 4 02/11/04 RBIO-0002-U1.
Genome Annotation and the Human Genome
Gene Structure.
Gene Structure.
Presentation transcript:

Genes, Genomes, and Genomics Bioinformatics in the Classroom June, 2003

Craig Venter, Celera Inc. Two. Again … Francis Collins, HGP Craig Venter, Celera Inc.

What’s in a chromosome?

Hierarchical vs. Whole Genome

The value of genome sequences lies in their annotation Annotation – Characterizing genomic features using computational and experimental methods Genes: Four levels of annotation Gene Prediction – Where are genes? What do they look like? Domains – What do the proteins do? Role – What pathway(s) involved in?

How many genes? Consortium: 35,000 genes? Celera: 30,000 genes? Affymetrix: 60,000 human genes on GeneChips? Incyte and HGS: over 120,000 genes? GenBank: 49,000 unique gene coding sequences? UniGene: > 89,000 clusters of unique ESTs?

Current consensus (in flux …) 15,000 known genes (similarity to previously isolated genes and expressed sequences from a large variety of different organisms) 17,000 predicted (GenScan, GeneFinder, GRAIL) Based on and limited to previous knowledge

What are genes? - 1 Complete DNA segments responsible to make functional products Products Proteins Functional RNA molecules RNAi (interfering RNA) rRNA (ribosomal RNA) snRNA (small nuclear) snoRNA (small nucleolar) tRNA (transfer RNA)

What are genes? - 2 Definition vs. dynamic concept Consider Prokaryotic vs. eukaryotic gene models Introns/exons Posttranscriptional modifications Alternative splicing Differential expression Genes-in-genes Genes-ad-genes Posttranslational modifications Multi-subunit proteins

Where do genes live? In genomes Example: human genome Ca. 3,200,000,000 base pairs 25 chromosomes : 1-22, X, Y, mt 28,000-45,000 genes (current estimate) 128 nucleotides (RNA gene) – 2,800 kb (DMD) Ca. 25% of genome are genes (introns, exons) Ca. 1% of genome codes for amino acids (CDS) 30 kb gene length (average) 1.4 kb ORF length (average) 3 transcripts per gene (average)

List of 68 eukaryotes, 141 bacteria, and 17 archaea at Sample genomes Species Size Genes Genes/Mb H.sapiens 3,200Mb 35,000 11 D.melanogaster 137Mb 13.338 97 C.elegans 85.5Mb 18,266 214 A.thaliana 115Mb 25,800 224 S.cerevisiae 15Mb 6,144 410 E.coli 4.6Mb 4,300 934  List of 68 eukaryotes, 141 bacteria, and 17 archaea at http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links2a.html

How do we get to the genes?

Prokaryotic gene model: ORF-genes “Small” genomes, high gene density Haemophilus influenza genome 85% genic Operons One transcript, many genes No introns. One gene, one protein Open reading frames One ORF per gene ORFs begin with start, end with stop codon (def.) TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html

And this?

Eukaryotic gene model: spliced genes Posttranscriptional modification 5’-CAP, polyA tail, splicing Open reading frames Mature mRNA contains ORF All internal exons contain open “read-through” Pre-start and post-stop sequences are UTRs Multiple translates One gene – many proteins via alternative splicing

Expansions and Clarifications ORFs Start – triplets – stop Prokaryotes: gene = ORF Eukaryotes: spliced genes or ORF genes Exons Remain after introns have been removed Flanking parts contain non-coding sequence (5’- and 3’-UTRs)

So much DNA – so “few” genes …

Genomic sequence features Repeats (“Junk DNA”) Transposable elements, simple repeats RepeatMasker Genes Vary in density, length, structure Identification depends on evidence and methods and may require concerted application of bioinformatics methods and lab research Pseudo genes Look-a-likes of genes, obstruct gene finding efforts. Non-coding RNAs (ncRNA) tRNA, rRNA, snRNA, snoRNA, miRNA tRNASCAN-SE, COVE

Gene identification Homology-based gene prediction Similarity Searches (e.g. BLAST, BLAT) Genome Browsers RNA evidence (ESTs) Ab initio gene prediction Gene prediction programs Prokaryotes ORF identification Eukaryotes Promoter prediction PolyA-signal prediction Splice site, start/stop-codon predictions

Gene prediction through comparative genomics Highly similar (Conserved) regions between two genomes are useful or else they would have diverged If genomes are too closely related all regions are similar, not just genes If genomes are too far apart, analogous regions may be too dissimilar to be found

Genome Browsers Generic Genome Browser (CSHL) www.wormbase.org/db/seq/gbrowse NCBI Map Viewer www.ncbi.nlm.nih.gov/mapview/ Ensembl Genome Browser www.ensembl.org/ UCSC Genome Browser genome.ucsc.edu/cgi-bin/hgGateway?org=human Apollo Genome Browser www.bdgp.org/annot/apollo/

Gene discovery using ESTs Expressed Sequence Tags (ESTs) represent sequences from expressed genes. If region matches EST with high stringency then region is probably a gene or pseudo gene. EST overlapping exon boundary gives an accurate prediction of exon boundary.

Ab initio gene prediction Prokaryotes ORF-Detectors Eukaryotes Position, extent & direction: through promoter and polyA-signal predictors Structure: through splice site predictors Exact location of coding sequences: through determination of relationships between potential start codons, splice sites, ORFs, and stop codons

Tools ORF detectors Promoter predictors PolyA signal predictors NCBI: http://www.ncbi.nih.gov/gorf/gorf.html Promoter predictors CSHL: http://rulai.cshl.org/software/index1.htm BDGP: fruitfly.org/seq_tools/promoter.html ICG: TATA-Box predictor PolyA signal predictors CSHL: argon.cshl.org/tabaska/polyadq_form.html Splice site predictors BDGP: http://www.fruitfly.org/seq_tools/splice.html Start-/stop-codon identifiers DNALC: Translator/ORF-Finder BCM: Searchlauncher

How it works I – Motif identification Exon-Intron Borders = Splice Sites Exon Intron Exon  ~~gaggcatcag|GTttgtagac~~~~~~~~~~~tgtgtttcAG|tgcacccact~~ ~~ccgccgctga|GTgagccgtg~~~~~~~~~~~tctattctAG|gacgcgcggg~~ ~~tgtgaattag|GTaagaggtt~~~~~~~~~~~atatctccAG|atggagatca~~ ~~ccatgaggag|GTgagtgcca~~~~~~~~~~~ttatttccAG|gtatgagacg~~ Splice site Splice site Exon Intron Exon  ~~gaggcatcag|gtttgtagac~~~~~~~~~~~tgtgtttcag|tgcacccact~~ ~~ccgccgctga|gtgagccgtg~~~~~~~~~~~tctattctag|gacgcgcggg~~ ~~tgtgaattag|gtaagaggtt~~~~~~~~~~~atatctccag|atggagatca~~ ~~ccatgaggag|gtgagtgcca~~~~~~~~~~~ttatttccag|gtatgagacg~~ Splice site Splice site Motif Extraction Programs at http://www-btls.jst.go.jp/

How it works II - Movies Pribnow-Box Finder 0/1 Pribnow-Box Finder all

How it works III – The (ugly) truth

Gene prediction programs Rule-based programs Use explicit set of rules to make decisions. Example: GeneFinder Neural Network-based programs Use data set to build rules. Examples: Grail, GrailEXP Hidden Markov Model-based programs Use probabilities of states and transitions between these states to predict features. Examples: Genscan, GenomeScan

Uberbacher and Mural PNAS (1991)

Burge, C.B. and S. Karlin, Finding the genes in genomic DNA. Curr Opin Struct Biol, 1998. 8(3): p. 346-54 Burge, C. and S. Karlin, Prediction of complete gene structures in human genomic DNA. J Mol Biol, 1997. 268(1): p. 78-94

Evaluating prediction programs Sensitivity vs. Specificity Sensitivity How many genes were found out of all present? Sn = TP/(TP+FN) Specificity How many predicted genes are indeed genes? Sp = TP/(TP+FP)

Evaluation of Gene Prediction Algorithms Sn = Sensitivity = TP/(TP+FN) How many exons were found out of total present? Sp = Specificity = TP/(TP+FP) How many predicted exons were correct out of total exons predicted? http://www1.imim.es/courses/Lisboa01/slide5.2.html

Gene prediction accuracies Nucleotide level: 95%Sn, 90%Sp (Lows less than 50%) Exon level: 75%Sn, 68%Sp (Lows less than 30%) Gene Level: 40% Sn, 30%Sp (Lows less than 10%) Programs that combine statistical evaluations with similarity searches most powerful.

Common difficulties First and last exons difficult to annotate because they contain UTRs. Smaller genes are not statistically significant so they are thrown out. Algorithms are trained with sequences from known genes which biases them against genes about which nothing is known. Masking repeats frequently removes potentially indicative chunks from the untranslated regions of genes that contain repetitive elements.

The annotation pipeline Mask repeats using RepeatMasker. Run sequence through several programs. Take predicted genes and do similarity search against ESTs and genes from other organisms. Do similarity search for non-coding sequences to find ncRNA.

Annotation nomenclature Known Gene – Predicted gene matches the entire length of a known gene. Putative Gene – Predicted gene contains region conserved with known gene. Also referred to as “like” or “similar to”. Unknown Gene – Predicted gene matches a gene or EST of which the function is not known. Hypothetical Gene – Predicted gene that does not contain significant similarity to any known gene or EST.