Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene Structure and Identification

Similar presentations

Presentation on theme: "Gene Structure and Identification"— Presentation transcript:

1 Gene Structure and Identification
Genes and Genomes ORFs and more Consensus Sequences Gene Finding Reading: sections 1.3, BIO520 Bioinformatics Jim Lund

2 Gene The functional and physical unit of heredity passed from parent to offspring. Genes are pieces of DNA, and most genes contain the information for making a specific protein.

3 Gene-Informatics Genes are character strings embedded in much larger strings called the genome. A gene usually encodes a protein. Genes are composed of ordered elements associated with the fundamental genetic processes including transcription, splicing, and translation.

4 Cells recognize genes from DNA sequence.
ACGT to Gene Cells recognize genes from DNA sequence.

5 Protein Coding Genes RNA genes rRNA tRNA siRNA, miRNA, snRNA, snoRNA…
snRNA -small nuclear RNA (mRNA splicing), U1, U2, U4, U5, and U6 snoRNA -smal nucleolar RNA (functional/catalytic in rRNA maturation) siRNA bp active element that binds RNA-induced silencing complex (RISC) and degrades complementary RNAs, see 2006 Nobel Prize in biology. miRNA-noncoding regulatory endogenous hairpin RNA bind to a gene’s 3' UTR and block translation Good ref:

6 Genomes Genome seq. has only limited use by itself
Markers, SNPs, etc. Functional annotation Identify proteins and their functions. And regulatory regions, etc. Parts list: a source for understanding all biology--and ushers in the post-genomic age of biology. snRNA -small nuclear RNA (mRNA splicing) snoRNA -smal nucleolar RNA (functional/catalytic in RNA maturation)

7 Genomes 3,100,000,000 2002 Mus musculus ,700,000,000

8 Characteristics of Protein Coding Genes
ORF long (usually >100 aa) “known” proteinslikely Basal signals Transcription, splicing, translation Regulatory signals Depend on organism Prokaryotes vs Eukaryotes Verterbrate vs fungi, eg. Yeast, ~1% of genes have ORFs<100 aa

9 Infer Gene Structure “Gene Model”
Promoter Strength Regulation mRNA Exons Splicing Stability ORF=protein

10 Genomes Gene Content E. coli 4000 genes X 1 kbp/gene=4 Mbp
Genome=4 Mbp! Gene-rich

11 Regulatory regions=300 Mb?
Genomes Gene Content Human 27,148 genes X 2 kbp=54 Mb mRNA Introns=300 Mb? Regulatory regions=300 Mb? 2,446 Mb = ?

12 Complex Genome DNA Hard!! ~10% highly repetitive (300 Mb)
NOT GENES ~25% moderate repetitive (750 Mb) Some genes ~10% exons and introns (354 Mb) 55% = ? Regulatory regions Intergenic regions Hard!!

13 Easy problem: Bacterial Gene Finding
Dense Genomes Short intergenic regions Uninterrupted ORFs Conserved signals Abundant comparative information Complete Genomes

14 E. coli genome 4,415 genes Ave. distance between genes: 118 bp
318 aa, average protein length 57 proteins longer than 1000 aa. 318 shorter than 100 aa. 2,584 operons, 70% contain one gene. 1.5% repetitive DNA (mostly viral fragments). E. coli was the first typical bacteria sequenced, also the first model organism sequenced.

15 Prokaryotic Gene Expression
Promoter Cistron1 Cistron2 CistronN Terminator Transcription RNA Polymerase mRNA 5’ 3’ 1 2 N Translation Ribosome, tRNAs, Protein Factors N N C N C C 1 2 3 Polypeptides

16 Prokaryotic gene prediction
ORFs Biased nucleotide distribution Periodicity of 3 Codon bias (codon usage statistics) Also called Codon Adaptation Index (CAI). Signal sequences Homology Other biological info: for E. coli, partial N-terminal protein sequences.

17 Prokaryotic signal sequences
Ribosome binding site (RBS)/Shine-Delgarno element 3-9 purines complementary to sequence at 3’ end of the 16S rRNA in the small subunit of the ribosome. Located: 4-7 bps 5’ of the AUG. Promoter -35 consensus site (TTGACA) -10 consensus site (TATAAT) Signal peptides Regulatory protein binding sites (4 to 8 bps) Detect consensus sequences using a position probability matrix or a neural network algorithm.

18 P(ORF)=(61/64)n P(20)=(61/64)20=.38 P(100)=0.008 P(200)=10-4 ORFs
Long open reading frames are relatively rare. P(100)=0.008 P(200)=10-4

19 ORF finding tools Artemis Testcode (Fickett’s) CodonPreference
analyze ORFs Testcode (Fickett’s) CodonPreference ORF Finder (NCBI) BCM Search Launcher NCBI ORF finder:

20 ORFs in E. coli Frame 1 2 3 ORFs are shown in blue. -1 -2 -3

21 Codon Bias Genetic code degenerate Codon usage varies
Organism to organism Gene to gene High bias correlates with high level expression Bias correlates with tRNA isoacceptors Change bias or tRNAs, change expression

22 Codon Bias Gly GGG Gly GGA Gly GGT Gly GGC

23 Codon Bias Gene Differences

24 Nucleotide Bias Useful: DNA sequence Errors?
Coding DNA vs non-Coding DNA often G+C content higher than bulk Empirical statistics (Fickett’s TESTCODE) Useful: ORF matches “typical” organism, bias ORF obscured by STOP codons DNA sequence Errors? Described by James Fickett in Nucleic Acids Research 10(17); (1982). Designed to work on a window of 200bp or more.

25 We found ORFs-now what? Work backwards Locate adjacent cistrons
Locate RBS Locate promoter Locate terminator Locate regulatory sites

26 Operon Structure Promoter? Determine based on:
Spacing of ORFs. <50bp, likely not a separate gene. 2)Promoter seq. If a gene has its own promoter, it is not part of the preceding operon. 3) Functional similarity. Operons often contain genes that function together. Promoter?

27 Translation Ribosome Binding Site, Shine-Dalgarno Site
nnAGGAGGnnnnnATG… Consensus not always used, example E. coli gene: nnAaGAGGnnnnATG ATG used >90% of time, GTG or TTG used infrequently, fMet (formylmethionine) still incorporated as 1st aa. (Better represented as a PSSM or a HMM)

28 Alternate sigma factors
Bacterial Promoter -35 T82T84G78A65C54A45… (16-18 bp)… T80A95T45A60A50T96…(A,G) Prokaryotic sigma 70 promoters usually used, but there are other sigma factors with different site preferences. Alternate sigma factors CCCTTGAA….CCCGATNT

29 Terminators Stem/loop 3’-U tail Rho-independent C-rich G-poor
structural only 3’-U tail Rho-independent C-rich G-poor “loose” consensus Rho-dependent

30 Difficulties in gene prediction
Frame shifts sequencing errors Overlapping ORFs Rare (a few percent) Short ORFs Unusual genes bp composition signal sequences

31 Programs for prokaryotic gene prediction
Glimmer ORPHEUS GeneMark 90%+ sensitivity and specificity GENSCAN Links to many gene prediction programs at

Download ppt "Gene Structure and Identification"

Similar presentations

Ads by Google