Presentation on theme: "Gene Structure and Identification"— Presentation transcript:
1 Gene Structure and Identification Genes and GenomesORFs and moreConsensus SequencesGene FindingReading: sections 1.3,BIO520 Bioinformatics Jim Lund
2 GeneThe functional and physical unit of heredity passed from parent to offspring. Genes are pieces of DNA, and most genes contain the information for making a specific protein.
3 Gene-InformaticsGenes are character strings embedded in much larger strings called the genome. A gene usually encodes a protein. Genes are composed of ordered elements associated with the fundamental genetic processes including transcription, splicing, and translation.
4 Cells recognize genes from DNA sequence. ACGT to GeneCells recognize genes from DNA sequence.
5 Protein Coding Genes RNA genes rRNA tRNA siRNA, miRNA, snRNA, snoRNA… snRNA -small nuclear RNA (mRNA splicing), U1, U2, U4, U5, and U6snoRNA -smal nucleolar RNA (functional/catalytic in rRNA maturation)siRNA bp active element that binds RNA-induced silencing complex (RISC) and degrades complementary RNAs, see 2006 Nobel Prize in biology.miRNA-noncoding regulatory endogenous hairpin RNA bind to a gene’s 3' UTR and block translation Good ref:
6 Genomes Genome seq. has only limited use by itself Markers, SNPs, etc.Functional annotationIdentify proteins and their functions.And regulatory regions, etc.Parts list: a source for understanding all biology--and ushers in the post-genomic age of biology.snRNA -small nuclear RNA (mRNA splicing)snoRNA -smal nucleolar RNA (functional/catalytic in RNA maturation)
8 Characteristics of Protein Coding Genes ORFlong (usually >100 aa)“known” proteinslikelyBasal signalsTranscription, splicing, translationRegulatory signalsDepend on organismProkaryotes vs EukaryotesVerterbrate vs fungi, eg.Yeast, ~1% of genes have ORFs<100 aa
14 E. coli genome 4,415 genes Ave. distance between genes: 118 bp 318 aa, average protein length57 proteins longer than 1000 aa.318 shorter than 100 aa.2,584 operons, 70% contain one gene.1.5% repetitive DNA (mostly viral fragments).E. coli was the first typical bacteria sequenced, also the first model organism sequenced.
16 Prokaryotic gene prediction ORFsBiased nucleotide distributionPeriodicity of 3Codon bias (codon usage statistics)Also called Codon Adaptation Index (CAI).Signal sequencesHomologyOther biological info: for E. coli, partial N-terminal protein sequences.
17 Prokaryotic signal sequences Ribosome binding site (RBS)/Shine-Delgarno element3-9 purines complementary to sequence at 3’ end of the 16S rRNA in the small subunit of the ribosome.Located: 4-7 bps 5’ of the AUG.Promoter-35 consensus site (TTGACA)-10 consensus site (TATAAT)Signal peptidesRegulatory protein binding sites (4 to 8 bps)Detect consensus sequences using a position probability matrix or a neural network algorithm.
18 P(ORF)=(61/64)n P(20)=(61/64)20=.38 P(100)=0.008 P(200)=10-4 ORFs Long open reading frames are relatively rare.P(100)=0.008P(200)=10-4
20 ORFs in E. coliFrame123ORFs are shown in blue.-1-2-3
21 Codon Bias Genetic code degenerate Codon usage varies Organism to organismGene to geneHigh bias correlates with high level expressionBias correlates with tRNA isoacceptorsChange bias or tRNAs, change expression
24 Nucleotide Bias Useful: DNA sequence Errors? Coding DNA vs non-Coding DNAoften G+C content higher than bulkEmpirical statistics (Fickett’s TESTCODE)Useful:ORF matches “typical”organism, biasORF obscured by STOP codonsDNA sequenceErrors?Described by James Fickett in Nucleic Acids Research 10(17); (1982).Designed to work on a window of 200bp or more.
25 We found ORFs-now what? Work backwards Locate adjacent cistrons Locate RBSLocate promoterLocate terminatorLocate regulatory sites
26 Operon Structure Promoter? Determine based on: Spacing of ORFs. <50bp, likely not a separate gene.2)Promoter seq. If a gene has its own promoter, it is not part of the preceding operon.3) Functional similarity. Operons often contain genes that function together.Promoter?
27 Translation Ribosome Binding Site, Shine-Dalgarno Site nnAGGAGGnnnnnATG…Consensus not always used,example E. coli gene:nnAaGAGGnnnnATGATG used >90% of time, GTG or TTG used infrequently, fMet (formylmethionine) still incorporated as 1st aa.(Better represented as a PSSM or a HMM)
28 Alternate sigma factors Bacterial Promoter-35T82T84G78A65C54A45…(16-18 bp)…T80A95T45A60A50T96…(A,G)Prokaryotic sigma 70 promoters usually used, but there are other sigma factors with different site preferences.Alternate sigma factorsCCCTTGAA….CCCGATNT